Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ADLSCreateObjectOperator #37821

Merged
merged 48 commits into from
Mar 15, 2024
Merged

Conversation

dabla
Copy link
Contributor

@dabla dabla commented Mar 1, 2024

Hello, as discussed in the previous PR I created this new pull request for introducing the DataToADLSOperator. This operator allows you to use the AzureDataLakeStorageV2Hook without the need to write custom python code to interact with the Hook.

There is already a LocalFilesystemToADLSOperator which uses the older AzureDataLakeHook, but that operator doesn't allow you to directly store an XCom to Fabric, you need to create a local file first before uploading it, which in this case would mean creating an additional task to do it while in fact it's not necessary as the AzureDataLakeStorageV2Hook allows you to achieve the same without that intermediate step.

That's why I introduced this DataToADLSOperator, we already use it at our company as a custom operator and it works great as this easily allows you to upload an XCom to Fabric, without the need to write python code interacting with the AzureDataLakeStorageV2Hook. This means a more consice DAG, less code which improves readability and reduces complexity and of course makes it easy to use.

Here is a simplified example on how to use the operator:

from airflow import DAG
from airflow.operators.jira_plugin import JiraOperator
from airflow.providers.microsoft.azure.transfers.data_to_local import DataToADLSOperator
from datetime import datetime

default_args = {
    'owner': 'airflow',
    'start_date': datetime(2024, 1, 1),
}

with DAG('jira_example_dag', default_args=default_args, schedule_interval=None) as dag:
    create_issue_task = JiraOperator(
        task_id='create_issue',
        jira_conn_id='jira_connection',
        method='POST',
        endpoint='issue',
        json={
            'fields': {
                'project': {'key': 'PROJ'},
                'summary': 'This is a test issue created from Airflow',
                'description': 'This is the description of the test issue.',
                'issuetype': {'name': 'Task'},
            }
        },
	dag=dag,
    )

    get_issue_task = JiraOperator(
        task_id='get_issue',
        jira_conn_id='jira_connection',
        method='GET',
        endpoint='issue/AIRFLOW-123',
	dag=dag,
    )

    write_to_azure_task = DataToADLSOperator(
        task_id="write_to_azure",
	azure_data_lake_conn_id="ms_azure_data_lake_connection",
	file_system_name="JIRA Monitoring",
	file_name="Files/JIRA/issues/Airflow/AIRFLOW-123.json",
	data=get_issue_task.output,
	overwrite=True,
	dag=dag,
    )

    create_issue_task >> get_issue_task >> write_to_azure_task

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@eladkal
Copy link
Contributor

eladkal commented Mar 7, 2024

Ok @eladkal it's because the branch was based on the other PR with proxy fix where I had to the new operator, so it will be hard to remove that file. What do you want me to do, create a new PR for this one based on main?

It's just 3 lines. It would be simpler to just re-add these lines to the file (they exist in main but not in your branch) then just commit it. Once you do that the diff will be no changes to the file.

Feel free to send me message on Slack if you need help

@dabla
Copy link
Contributor Author

dabla commented Mar 7, 2024

Ok @eladkal it's because the branch was based on the other PR with proxy fix where I had to the new operator, so it will be hard to remove that file. What do you want me to do, create a new PR for this one based on main?

It's just 3 lines. It would be simpler to just re-add these lines to the file (they exist in main but not in your branch) then just commit it. Once you do that the diff will be no changes to the file.

Feel free to send me message on Slack if you need help

Hello Edal, first of all thank you for your patience now I fully understand what happend. This file is indeed not the same as the main branch (I forgot to sync main on my repo hence why I didn't see the latestes updates and diff). But it should be changed, because it contains remains of the test of that new operator I had to remove back then if you remember, but on that PR I apparently forgot to remove those unused constants. Those constants are not used in that test anymore, so I would keep the changes as they clean up the test, but if you prefer to keep it then I will re-add those 3 lines if you want. Let me know what you I have to do.

@eladkal
Copy link
Contributor

eladkal commented Mar 8, 2024

OK now i get it.
the claim is that the changes were added in #37103 by mistake :)
This is why separating unrelated changes in PRs is so important it. It would have been much easier to understand and review if a separated PR would raise just to revert the unneeded changes (no need to do that now)

So all left here is just to fix the doc build.

@dabla
Copy link
Contributor Author

dabla commented Mar 8, 2024

OK now i get it. the claim is that the changes were added in #37103 by mistake :) This is why separating unrelated changes in PRs is so important it. It would have been much easier to understand and review if a separated PR would raise just to revert the unneeded changes (no need to do that now)

So all left here is just to fix the doc build.

Exactly, it was my mistake as I didn't clean up the test well in my previous pull request hence all the confussion, so apologies once again. Yeah the doc thing is a mysterie to me imho, some help or some pointers in a direction could help here :)

@eladkal
Copy link
Contributor

eladkal commented Mar 8, 2024

Exactly, it was my mistake as I didn't clean up the test well in my previous pull request hence all the confussion, so apologies once again. Yeah the doc thing is a mysterie to me imho, some help or some pointers in a direction could help here :)

hard to tell.
Can you extract the adsl list docs to a separated PR? (again, it's not part of the ADLSCreateObjectOperator)
That way we can check it separately and see which one is causing the doc build to fail

@dabla
Copy link
Contributor Author

dabla commented Mar 11, 2024

Exactly, it was my mistake as I didn't clean up the test well in my previous pull request hence all the confussion, so apologies once again. Yeah the doc thing is a mysterie to me imho, some help or some pointers in a direction could help here :)

hard to tell. Can you extract the adsl list docs to a separated PR? (again, it's not part of the ADLSCreateObjectOperator) That way we can check it separately and see which one is causing the doc build to fail

I reverted that change but the error stays the same

@dabla
Copy link
Contributor Author

dabla commented Mar 13, 2024

@eladkal I finally managed to make the build pass, I kept the example for the list operator as I found what was causing the issue ( a missing single quote :))

@dabla dabla requested a review from eladkal March 14, 2024 20:42
@eladkal eladkal merged commit 60b95c7 into apache:main Mar 15, 2024
52 checks passed
@dabla dabla deleted the feature/data-to-adls-operator branch March 15, 2024 12:02
@dabla
Copy link
Contributor Author

dabla commented Mar 15, 2024

Thanks @eladkal :)

utkarsharma2 pushed a commit to astronomer/airflow that referenced this pull request Apr 22, 2024
* feat: Pass proxies config when using ClientSecretCredential in AzureDataLakeStorageV2Hook and added DataToADLSOperator which allows uploading data (e.g. from an XCOM) to a remote file without the need to have a local file created first
---------

Co-authored-by: David Blain <[email protected]>
Co-authored-by: David Blain <[email protected]>
Co-authored-by: Elad Kalif <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants