-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding airflow operator to submit and monitor Spark Apps/App Env on DPGDC clusters #37223
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find anything about this in the GCP documentation, could you please add the documentation link?
Is the CRD dataprocgdc.cloud.google.com/v1alpha1
based on on sparkoperator.k8s.io/v1beta2
? I'm asking because we already have two operators for the spark-on-k8s-operator
, so maybe we can use one of them as a superclass to your operator to avoid code duplication and implementing everything from scratch.
Hey Hussein, The CRD for DPGDC is completely different than the sparkoperator.k8s.io/v1beta2, which stops us from leveraging the same operator/sensor.
|
It will be a bit complicated to review this PR without a doc, I will try to review the syntax and check if our conventions are respected. (cc: @eladkal could you take a look?)
No problem with that, my goal was to reduce code duplication if possible, but since they are different, we can keep building your operator from scratch. |
tests/system/providers/google/cloud/dataprocgdc/example_dataprocgdc.py
Outdated
Show resolved
Hide resolved
tests/system/providers/google/cloud/dataprocgdc/example_dataprocgdc.py
Outdated
Show resolved
Hide resolved
Hi @akashsriv07, I will propose a different approach for the folder/module structure. Since GDC might have different APIs, or different inner working mechanisms, it can be good to separate the GDC related things (like Hence, I am suggesting creating a new module inside a
What do you think? |
Hey @molcay , GDC is part of Google Cloud only. Shouldn't it be a part of it? |
Hey @akashsriv07, When I read the following sentence on the product page, I thought that this product is not directly in the GCP, rather it is equivalent to the GCP but it is for on-premise infrastructure:
But of course, I am not entirely sure on the topic. This suggestion folds on a gray area I guess. airflow/providers/google/cloud
├── example_dags
├── fs
├── gdc # <-- this is the new one
├── hooks
├── __init__.py
├── _internal_client
├── links
├── log
├── operators
├── __pycache__
├── secrets
├── sensors
├── transfers
├── triggers
└── utils Also, maybe it is too early to have this module structure right now. We can also wait and see how it will extend |
Hey, Michal from Cloud Composer team here. |
845ef92
to
3ba983e
Compare
3ba983e
to
f1670c7
Compare
Given the above statemant and the fact that we are going to seprate google to several providers we should treat this PR like adding a new provider which means it needs to go through the approval cycle like all any other new provider.
That will not work and this is one of the reason why seperating Google provider is not an easy task. You can't have utill shared between two providers. Utill belongs to one provider. Maybe Google can intoduce google.common provider but that would require separate discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting request changes to avoid accedental merge
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
Will explore the feasibility on creating the provider and the process to be followed. Thanks |
2ac0e08
to
a045d64
Compare
a045d64
to
5fce24c
Compare
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
Adding airflow operator to create and monitor Spark Apps/App Env on DPGDC clusters
GDC extends Google Cloud’s infrastructure and services to customer edge locations and data centers. We intend to bring Dataproc as a managed service offering to GDC.
As part of this, we need to integrate Airflow with Dataproc on GDC API resources to trigger Apache Spark workloads. The integration needs to cover two distinct API paths: local execution through the KRM API and remote execution through an equivalent One Platform API. We're targeting the former one in this PR i.e. Airflow operator leveraging KRM APIs.
This PR contains two feature implementations :
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.