-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Airflow Astro deployment docs #3792
Update Airflow Astro deployment docs #3792
Conversation
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fantastic start @DimedS ⭐ I have left some comments inline, but in general I think it would be useful to add some more context and explanations about why certain steps are needed. Also a minor note on referencing Kedro class names, when talking about a class in Kedro e.g. DataCatalog
the best practice would be to use the exact class name or the regular english name for it e.g. data catalog, but not a hybrid like Data Catalog.
|
||
2. Create a new Kedro project using the `astro-airflow-iris` starter. You can use the default value in the project creation process: | ||
1. To create a new Kedro project, select the `example=yes` option to include example code. Additionally, to implement custom logging, select `tools=log`. Proceed with the default project name, but feel free to add any other tools as desired: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this requires a bit more of an intro to explain why the user needs the select the example project and the custom logging. Perhaps instead of immediately starting with the steps, give a bit of an intro explaining what will happen in the subsequent steps and what the expected end result is: e.g. a Kedro spaceflights project with logging setup that can be run on airflow.
* **Step 2.3**: Modify the `Dockerfile` to have the following content: | ||
This step should produce a .py file called `new_kedro_project_dag.py` located at `dags/`. | ||
|
||
### Deployment process with Astro Airflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would again add a bit of an introduction paragraph before diving into the steps to explain what we'll be doing in this part of the tutorial.
astro dev init | ||
``` | ||
|
||
2. The folder `kedro-airflow-spaceflights` will be executed within the Airflow container. To run our Kedro project there, we need to copy several items from the previous sections into it: the `/data` folder from Step 1, the `/conf` folder from Steps 2-4, the `.whl` file from Step 5, and the Airflow DAG from Step 6: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be worth explaining why these items need to be copied over, so that users will have a better understanding of how things work when they're running a different project than spaceflights on airflow.
Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Dmitry Sorokin <[email protected]>
Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Dmitry Sorokin <[email protected]>
Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Dmitry Sorokin <[email protected]>
Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Dmitry Sorokin <[email protected]>
Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Dmitry Sorokin <[email protected]>
Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Dmitry Sorokin <[email protected]>
Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Thank you, @merelcht ! I agree and hope I've addressed your comments. |
|
||
In this section, we'll start by setting up a new blank Airflow project using Astro. We'll then copy the files prepared in the previous section from our Kedro project. Next, we'll customize the Dockerfile to enhance logging capabilities and manage the installation of our Kedro package. Finally, we will run and explore the Airflow cluster. | ||
|
||
1. [Initialise an Airflow project with Astro](https://docs.astronomer.io/astro/cli/develop-project) in a new folder outside of your Kedro project. Let's call it `kedro-airflow-spaceflights` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed through the tutorial and got stuck at this step because I didn't have the astro
CLI installed. It would be good to mention it needs to be installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merelcht , we have a section Prerequisites, where that is mentioned, do you think that it's not enough? Also when you will click the link and try to install Astro CLI it will be mentioned there that you need Docker for that:
Prerequisites
To follow this tutorial, ensure you have the following:
The Astro CLI installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes I read over the Astro CLI part. I think it won't hurt to mention again at this point in the tutorial you need the Astro CLI and Docker. I would personally also mention Docker explicitly in the prereqs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I've included an additional note about Docker and the Astro CLI at the start of section 2 and also mentioned Docker in the prerequisites.
cp new-kedro-project/dags/new_kedro_project_dag.py kedro-airflow-spaceflights/dags/ | ||
``` | ||
|
||
3. Add a few lines to the `Dockerfile` to set the environment variable `KEDRO_LOGGING_CONFIG` to point to `conf/logging.yml` to enable custom logging in Kedro and to install the .whl file of our prepared Kedro project into the Airflow container: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should specify which Dockerfile
you mean, because at this point we've created new-kedro-project
and kedro-airflow-spaceflights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, fixed
|
||
RUN pip install --user dist/new_kedro_project-0.1-py3-none-any.whl | ||
``` | ||
|
||
#### Step 3. Convert the Kedro pipeline into an Airflow DAG with `kedro airflow` | ||
4. Navigate to `kedro-airflow-spaceflights` folder and launch the local Airflow cluster with Astronomer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got stuck here as well, because I didn't have docker installed. So again it would be good to give some guidance or mention beforehand that users need to have astro
and docker
installed.
|
||
RUN pip install --user dist/new_kedro_project-0.1-py3-none-any.whl | ||
``` | ||
|
||
#### Step 3. Convert the Kedro pipeline into an Airflow DAG with `kedro airflow` | ||
4. Navigate to `kedro-airflow-spaceflights` folder and launch the local Airflow cluster with Astronomer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got stuck here as well, because I didn't have docker installed. So again it would be good to give some guidance or mention beforehand that users need to have astro
and docker
installed.
Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Dmitry Sorokin <[email protected]>
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a more thorough review of the prose and the workflow. Fantastic job @DimedS ! Left a few comments
docs/source/deployment/airflow.md
Outdated
@@ -0,0 +1,181 @@ | |||
# Apache Airflow | |||
|
|||
Apache Airflow is a popular open-source workflow management platform. It is a suitable engine to orchestrate and execute a pipeline authored with Kedro because workflows in Airflow are modelled and organised as [DAGs](https://en.wikipedia.org/wiki/Directed_acyclic_graph). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for the future: we should have an entry for DAGs in our Glossary! https://docs.kedro.org/en/stable/resources/glossary.html
docs/source/deployment/airflow.md
Outdated
astro dev init | ||
``` | ||
|
||
2. The folder `kedro-airflow-spaceflights` will be executed within the Airflow container. To run our Kedro project there, we need to copy several items from the previous section into it: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. The folder `kedro-airflow-spaceflights` will be executed within the Airflow container. To run our Kedro project there, we need to copy several items from the previous section into it: | |
2. The folder `kedro-airflow-spaceflights` will be executed within the Airflow container. To run the Kedro project there, you need to copy several items from the previous section into it: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if there's a simpler workflow that doesn't involve copying files around. Asking because, during development, I understand that the user would want to update the configuration several times, and this involves repeating the generating of the wheel and the copying process every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a really good question. Previously, the manual described fully moving the Kedro project into the Astro project, causing them to share the same folder. This seems a bit risky to me because the two projects begin to share the same files, such as the README, requirements, and .gitignore. And it wasn't clear exactly what needed to be transferred from the Kedro project to enable it to run in Astro, so I clarified that. However, it makes sense to keep them together for the reasons you mentioned—it will be easier to work with, update Kedro, repackage, and rerun Astro from the same folder. But in this case, we should be cautious about managing these common files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's best to keep the manual as it is to clearly illustrate how Kedro and Astro are integrated, enhancing a user's understanding. However, we could add a recommendation at the end: "Feel free to combine them in the same folder if your project requires frequent updates, DAG recreation, and repackaging." What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
|
||
## How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster | ||
|
||
The `kedro-airflow-k8s` plugin from GetInData | Part of Xebia enables you to run a Kedro pipeline on Airflow with a Kubernetes cluster. The plugin can be used together with `kedro-docker` to prepare a docker image for pipeline execution. At present, the plugin is available for versions of Kedro < 0.18 only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that GiD was going to deprecate this plugin as the same could be already achieved (not as easily unfortunately) with the official plugin. @Lasica
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue the conversation in kedro-org/kedro-plugins#652 then, to see to what extent the official plugin can bridge the gap.
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
I believe I've addressed most of the comments. Could you please do a final check, @ankatiyar , @merelcht , @astrojuanlu ? |
@sbrugman By any chance do you have a moment to give this a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of nitpicks but other than that this looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a small nitpicks as well, but other than that very happy with these docs! Great work @DimedS 👍
Considering this is quite a substantial re-write, I'd also add it to the release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the default dataset factory is the only thing that needs to be updated!
Apache Airflow is a popular open-source workflow management platform. It is a suitable engine to orchestrate and execute a pipeline authored with Kedro because workflows in Airflow are modelled and organised as [DAGs](https://en.wikipedia.org/wiki/Directed_acyclic_graph). | ||
|
||
## How to run a Kedro pipeline on Apache Airflow with Astronomer | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For users who wish to deploy on using Atronomer this tutorial is clear. For users that have a different Airflow provider, it should be mentioned which steps are not relevant and that Kedro Airflow is not specific to astronomer.
In the logical flow (reading top to bottom), this information should be placed above Astronomer specific instructions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was following along the tutorial and came across v minor things
Co-authored-by: Merel Theisen <[email protected]> Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Description
This PR addresses issues 1, 2, 3, 5, and 6 from 605:
kedro-airflow-k8s
plugin has been relocated to the end of the document because it is compatible only with Kedro versions earlier than 0.17.astro-airflow-iris
starter has been replaced with the standardspaceflights-pandas
example pipeline.Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file