Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-5945] Make inbuilt OperatorLinks work when using Serialization #6715

Merged
merged 9 commits into from
Dec 6, 2019

Conversation

kaxil
Copy link
Member

@kaxil kaxil commented Dec 2, 2019

Make sure you have checked all steps below.

Jira

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    The inbuilt Operator links were not working when using DAG Serialization (which was 1 of the limitations of DAG Serialization). The work-around previously was adding those OperatorLinks using Airflow Plugin.

This PR fixes it so we inbuilt links work out of the box.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Added

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

@kaxil kaxil requested a review from ashb December 2, 2019 22:07
@mik-laj
Copy link
Member

mik-laj commented Dec 2, 2019

What do you think to save the generated links in a separate column in the database after Task execution? I think it will be nicer because it will not be storing additional partial data, but the final data that interests us. This could be realized independent of DAG serialization. In addition, it will simplify the logic of handling these links. Now it is not possible to download all links for a given task, but we download one link in one HTTP request. https://github.com/apache/airflow/blob/master/airflow/www/templates/airflow/dag.html#L492-L518

@kaxil
Copy link
Member Author

kaxil commented Dec 2, 2019

What do you think to save the generated links in a separate column in the database after Task execution? I think it will be nicer because it will not be storing additional partial data, but the final data that interests us. This could be realized independent of DAG serialization. In addition, it will simplify the logic of handling these links. Now it is not possible to download all links for a given task, but we download one link in one HTTP request. https://github.com/apache/airflow/blob/master/airflow/www/templates/airflow/dag.html#L492-L518

This is actually the 1st solution I mentioned to Ash and I thought it was the way to go :) Unfortunately Ash mentioned a use-case that ruled it out which was I think "S3 signed URL can have an expiration date of 15 mins and change after that". @ashb can probably explain that in more detail but due to that use-case storing the links in DB wouldn't work.

@codecov-io
Copy link

codecov-io commented Dec 3, 2019

Codecov Report

Merging #6715 into master will decrease coverage by 0.28%.
The diff coverage is 91.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6715      +/-   ##
==========================================
- Coverage   84.83%   84.54%   -0.29%     
==========================================
  Files         669      669              
  Lines       37738    37834      +96     
==========================================
- Hits        32014    31988      -26     
- Misses       5724     5846     +122
Impacted Files Coverage Δ
airflow/models/baseoperator.py 96.07% <100%> (+0.01%) ⬆️
airflow/models/dag.py 90.95% <100%> (ø) ⬆️
airflow/gcp/operators/bigquery.py 84.39% <100%> (+0.2%) ⬆️
airflow/contrib/operators/qubole_operator.py 88.57% <100%> (+0.87%) ⬆️
airflow/plugins_manager.py 87.74% <77.77%> (-1.47%) ⬇️
airflow/serialization/serialized_objects.py 91.43% <90.9%> (-0.17%) ⬇️
airflow/utils/tests.py 85.89% <92.3%> (+4.5%) ⬆️
airflow/kubernetes/volume_mount.py 44.44% <0%> (-55.56%) ⬇️
airflow/kubernetes/volume.py 52.94% <0%> (-47.06%) ⬇️
airflow/kubernetes/pod_launcher.py 45.25% <0%> (-46.72%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce873af...3596a42. Read the comment docs.

@ashb
Copy link
Member

ashb commented Dec 3, 2019

Yeah, we did think about that as a solution (which means we should probably add a comment/"architecture decision record" somewhere) but the issue is that I know of a link I want to add to EMR that isn't fully static: Linking from EmrAddStepOperator to the job logs in S3, which involves signing a URL that will only be valid for 15minutes. (This uses the Airflow servers AWS credentials, rather than needing the user to have the right permissions, which doesn't really work with S3 easily.)

Because of this only allowing static links in the DB would not work for this use case

# If OperatorLinks is defined in Plugins but not in the Operator
# set the Operator links attribute
if op_extra_links_from_plugin and "_operator_links_sources" not in encoded_op:
setattr(op, "operator_extra_links", list(op_extra_links_from_plugin.values()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we define Links in a plugin and in the operator?

I'm finding this code (including the existing code) a bit hard to follow. Hmmmm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define in a plugin and in operator, then the following code get's executed:
https://github.com/apache/airflow/pull/6715/files#diff-f58748b66b2d3d00c8132103faea223fR114-R121

I can add more detailed comments if it is not clear.

for _operator_links_source in encoded_op_links:
_operator_link_class, data = list(_operator_links_source.items())[0]
try:
single_link = import_string(_operator_link_class)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should do this here (the import string) - my first thought was that we should look through the loaded plugins only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed for the inbuilt OperatorLinks (BigQuery and Qubole) to work

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere I'd like to see us do a "static" deserialization that includes a Links - both ones defined in the operator, and one registered via a plugin, like we have with serialized_simple_dag_ground_truth.

(it might be time to move those to JSON files rather than having them in the test file? Don't mind either way)

@kaxil
Copy link
Member Author

kaxil commented Dec 3, 2019

Somewhere I'd like to see us do a "static" deserialization that includes a Links - both ones defined in the operator, and one registered via a plugin, like we have with serialized_simple_dag_ground_truth.

(it might be time to move those to JSON files rather than having them in the test file? Don't mind either way)

Added

@kaxil kaxil requested a review from ashb December 3, 2019 20:53
@kaxil kaxil force-pushed the AIRFLOW-5945-predefined-op-link branch from 2ebd096 to f6bb1a7 Compare December 3, 2019 21:10
@kaxil kaxil requested a review from potiuk December 3, 2019 21:10
@kaxil kaxil force-pushed the AIRFLOW-5945-predefined-op-link branch 2 times, most recently from f1664d2 to 905ced1 Compare December 3, 2019 23:36
airflow/models/baseoperator.py Outdated Show resolved Hide resolved
@kaxil kaxil force-pushed the AIRFLOW-5945-predefined-op-link branch 2 times, most recently from 68ecb72 to 6d98614 Compare December 5, 2019 18:25
@kaxil kaxil force-pushed the AIRFLOW-5945-predefined-op-link branch from 6d98614 to 3c6ad8e Compare December 5, 2019 18:27
airflow/serialization/serialized_objects.py Outdated Show resolved Hide resolved
airflow/utils/tests.py Show resolved Hide resolved
airflow/utils/tests.py Outdated Show resolved Hide resolved
op_predefined_extra_links = {}

for _operator_links_source in encoded_op_links:
_operator_link_class, data = list(_operator_links_source.items())[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add this to our JSON Schema. Something like this is the part of the schema we want for op_links:

{ "type": "array",
  "items": {
    "type": "object",
    "minProperties": 1,
    "maxProperties": 1
  }
}

Copy link
Member

@ashb ashb Dec 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line took me a bit to work out what it's doing -- It modifies it in place, but _operator_link_class, data = _operator_links_source.popitem() works and is a bit clearer to me. What do you think?
(Don't know if that works for py2, which isn't a problem here but is when backporting this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kaxil kaxil merged commit 803a87f into apache:master Dec 6, 2019
@kaxil kaxil deleted the AIRFLOW-5945-predefined-op-link branch December 6, 2019 23:07
kaxil added a commit that referenced this pull request Dec 18, 2019
ashb pushed a commit that referenced this pull request Dec 19, 2019
kaxil added a commit that referenced this pull request Dec 19, 2019
galuszkak pushed a commit to FlyrInc/apache-airflow that referenced this pull request Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants