Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding airflow operator to submit and monitor Spark Apps/App Env on DPGDC clusters #37223

Closed
wants to merge 11 commits into from

Conversation

akashsriv07
Copy link
Contributor

Adding airflow operator to create and monitor Spark Apps/App Env on DPGDC clusters

GDC extends Google Cloud’s infrastructure and services to customer edge locations and data centers. We intend to bring Dataproc as a managed service offering to GDC.

As part of this, we need to integrate Airflow with Dataproc on GDC API resources to trigger Apache Spark workloads. The integration needs to cover two distinct API paths: local execution through the KRM API and remote execution through an equivalent One Platform API. We're targeting the former one in this PR i.e. Airflow operator leveraging KRM APIs.

This PR contains two feature implementations :

  1. An Airflow operator can submit a SparkApplication via the KRM API, enabling customers to use Airflow to orchestrate Dataproc on GDC workloads locally.
  2. An Airflow operator can create an ApplicationEnvironment via both the KRM API.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers area:system-tests provider:google Google (including GCP) related issues labels Feb 7, 2024
Copy link

boring-cyborg bot commented Feb 7, 2024

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@akashsriv07 akashsriv07 changed the title Adding airflow operator to create and monitor Spark Apps/App Env on DPGDC clusters Adding airflow operator to submit and monitor Spark Apps/App Env on DPGDC clusters Feb 7, 2024
Copy link
Member

@hussein-awala hussein-awala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find anything about this in the GCP documentation, could you please add the documentation link?

Is the CRD dataprocgdc.cloud.google.com/v1alpha1 based on on sparkoperator.k8s.io/v1beta2? I'm asking because we already have two operators for the spark-on-k8s-operator, so maybe we can use one of them as a superclass to your operator to avoid code duplication and implementing everything from scratch.

@akashsriv07
Copy link
Contributor Author

Hey Hussein,
Dataproc on GDC is yet to go GA and this is one of the critical features to add as part of the same. Hence we don't have a public doc on the same.
You can take a look at this blog post though: https://cloud.google.com/blog/products/infrastructure-modernization/google-distributed-cloud-new-ai-and-data-services

The CRD for DPGDC is completely different than the sparkoperator.k8s.io/v1beta2, which stops us from leveraging the same operator/sensor.
using KRM APIs is one of the ways to interact with a DPGDC cluster.
In future, we plan to add the operator similar to dataproc.py(ex: DataprocSubmitSparkJobOperator) which will leverage Google's internal One Platform API mechanism.

I didn't find anything about this in the GCP documentation, could you please add the documentation link?

Is the CRD dataprocgdc.cloud.google.com/v1alpha1 based on on sparkoperator.k8s.io/v1beta2? I'm asking because we already have two operators for the spark-on-k8s-operator, so maybe we can use one of them as a superclass to your operator to avoid code duplication and implementing everything from scratch.

@hussein-awala
Copy link
Member

Hey Hussein, Dataproc on GDC is yet to go GA and this is one of the critical features to add as part of the same. Hence we don't have a public doc on the same. You can take a look at this blog post though: https://cloud.google.com/blog/products/infrastructure-modernization/google-distributed-cloud-new-ai-and-data-services

It will be a bit complicated to review this PR without a doc, I will try to review the syntax and check if our conventions are respected. (cc: @eladkal could you take a look?)

The CRD for DPGDC is completely different than the sparkoperator.k8s.io/v1beta2, which stops us from leveraging the same operator/sensor. using KRM APIs is one of the ways to interact with a DPGDC cluster. In future, we plan to add the operator similar to dataproc.py(ex: DataprocSubmitSparkJobOperator) which will leverage Google's internal One Platform API mechanism.

No problem with that, my goal was to reduce code duplication if possible, but since they are different, we can keep building your operator from scratch.

@molcay
Copy link
Contributor

molcay commented Feb 23, 2024

Hi @akashsriv07,

I will propose a different approach for the folder/module structure. Since GDC might have different APIs, or different inner working mechanisms, it can be good to separate the GDC related things (like hooks, operators, sensors) from the cloud one. In the future, if you want to add more things (like hooks, operators, sensors), it can be easily extended in a more structured way. Also, you will have access for the existing Google Cloud operators, sensors, etc.

Hence, I am suggesting creating a new module inside a google provider (at the same level with the cloud module). Something like the following:

airflow/providers/google
├── ads
├── cloud
├── common
├── firebase
├── gdc                         # <-- this is the new one
├── leveldb
├── marketing_platform
├── suite
├── CHANGELOG.rst
├── go_module_utils.py
├── __init__.py
└── provider.yaml

What do you think?

@akashsriv07
Copy link
Contributor Author

Hi @akashsriv07,

I will propose a different approach for the folder/module structure. Since GDC might have different APIs, or different inner working mechanisms, it can be good to separate the GDC related things (like hooks, operators, sensors) from the cloud one. In the future, if you want to add more things (like hooks, operators, sensors), it can be easily extended in a more structured way. Also, you will have access for the existing Google Cloud operators, sensors, etc.

Hence, I am suggesting creating a new module inside a google provider (at the same level with the cloud module). Something like the following:

airflow/providers/google
├── ads
├── cloud
├── common
├── firebase
├── gdc                         # <-- this is the new one
├── leveldb
├── marketing_platform
├── suite
├── CHANGELOG.rst
├── go_module_utils.py
├── __init__.py
└── provider.yaml

What do you think?

Hey @molcay , GDC is part of Google Cloud only. Shouldn't it be a part of it?

@molcay
Copy link
Contributor

molcay commented Feb 26, 2024

Hey @akashsriv07,

When I read the following sentence on the product page, I thought that this product is not directly in the GCP, rather it is equivalent to the GCP but it is for on-premise infrastructure:

Build, deploy, and scale modern industry & public sector applications on-premise with AI ready modern infrastructure, ensure data security, and enable agile developer workflows across edge locations & data centers.

But of course, I am not entirely sure on the topic. This suggestion folds on a gray area I guess.
On the other hand, at the base level, maybe we can group the changes related to GDC in a single module and put it inside of the cloud module.

airflow/providers/google/cloud
├── example_dags
├── fs
├── gdc                         # <-- this is the new one
├── hooks
├── __init__.py
├── _internal_client
├── links
├── log
├── operators
├── __pycache__
├── secrets
├── sensors
├── transfers
├── triggers
└── utils

Also, maybe it is too early to have this module structure right now. We can also wait and see how it will extend

@michalmodras
Copy link

Hey, Michal from Cloud Composer team here.
+1 for this code living in a separate, GDC provider - GDC is based on slightly different infra than GCP, and the operators won't necessarily work in the same way (some might be very similar, some very different). If there's shared code, I propose having a util module/provider between these two.
FWIW we plan to separate out more Google providers (e.g. taking out ads to a separate one) to decouple support of Google services a bit, instead of having them all in one provider, at times unnecessarily tangling e.g. dependencies.

@akashsriv07 akashsriv07 force-pushed the addDataprocGDCKrmOperators branch 3 times, most recently from 845ef92 to 3ba983e Compare March 11, 2024 08:28
@eladkal
Copy link
Contributor

eladkal commented Mar 13, 2024

Hey, Michal from Cloud Composer team here.
+1 for this code living in a separate, GDC provider - GDC is based on slightly different infra than GCP, and the operators won't necessarily work in the same way (some might be very similar, some very different).

Given the above statemant and the fact that we are going to seprate google to several providers we should treat this PR like adding a new provider which means it needs to go through the approval cycle like all any other new provider.
https://github.com/apache/airflow/blob/main/PROVIDERS.rst#accepting-new-community-providers

If there's shared code, I propose having a util module/provider between these two.

That will not work and this is one of the reason why seperating Google provider is not an easy task. You can't have utill shared between two providers. Utill belongs to one provider. Maybe Google can intoduce google.common provider but that would require separate discussion.
I really advise against adding more completely to Google provider it will make things harder for the decouple project

Copy link
Contributor

@eladkal eladkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting request changes to avoid accedental merge

Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Apr 30, 2024
@akashsriv07
Copy link
Contributor Author

Setting request changes to avoid accedental merge

Will explore the feasibility on creating the provider and the process to be followed. Thanks

@eladkal eladkal removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jul 5, 2024
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 31, 2024
@github-actions github-actions bot closed this Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers area:system-tests provider:google Google (including GCP) related issues stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants