Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-774] Fix long-broken DAG parsing Statsd metrics #6157

Merged
merged 1 commit into from
Sep 24, 2019

Conversation

ashb
Copy link
Member

@ashb ashb commented Sep 20, 2019

Make sure you have checked all steps below.

Jira

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    Since we switched to using sub-processes to parse the DAG files sometime
    back in 2016(!) the metrics we have been emitting about dag bag size and
    parsing have been incorrect.

    We have also been emitting metrics from the webserver which is going to
    be become wrong as we move towards a stateless webserver.

    To fix both of these issues I have stopped emitting the metrics from
    models.DagBag and only emit them from inside the
    DagFileProcessorManager.

    (There was also a bug in the dag.loading-duration.* we were emitting
    from the DagBag code where the "dag_file" part of that metric was empty.
    I have fixed that even though I have now deprecated that metric)

Tests

  • No, cos these are a bit hard to test usefully, but I have looked at the metrics emmited.

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

@@ -389,24 +388,19 @@ def collect_dags(
td = td.total_seconds() + (
float(td.microseconds) / 1000000)
stats.append(FileLoadStat(
filepath.replace(dag_folder, ''),
filepath.replace(settings.DAGS_FOLDER, ''),
Copy link
Member Author

@ashb ashb Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milton0825 This was causing the parse time to metric to be emited as dag.loading-duration.. Whoops :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.....

@ashb ashb force-pushed the dagparsing-metrics branch 3 times, most recently from 047e99d to 314e450 Compare September 20, 2019 15:28
@feng-tao
Copy link
Member

@ashb could you elaborate on why the dag parsing time metric is incorrect(lyft has been used for a while)? I could understand why Dag bag size metric is wrong. But my understanding is that even if we have multi process parse dag bag, each dag will only get parsed by one processed.

@ashb
Copy link
Member Author

ashb commented Sep 20, 2019

@feng-tao The same metric is emited in multiple places. If you look you will find both (one eith the filename and one without) Maybe statsd ignores that one (I was looking at the metrics emitted via running netcat).

But to give it more detail. Take a dag_folder of /opt/airflow/dags and /opt/airflow/dags/dag1.py

  • When parsing a single DAG file, create a dag bag like this
    dagbag = models.DagBag(file_path, include_examples=False)
    (for example /opt/airflow/dags/dag1.py)
  • So dag_folder is in fact the path to a single file, not the whole dag folder as we might expect from the name.
  • When it comes to do filepath.replace(dag_folder, ''):
    • filepath is /opt/airflow/dags/dag1.py
    • dag_folder is ALSO /opt/airflow/dags/dag1.py
    • so the end result is ''
  • So the final metric that is emitted is 'airflow.dag.loading-duration.' (without a filename on the end) - At least from this part of the scheduler the correct metric isn't emitted. Another component that loads dags will probably give the right value if it loads the "full" dag bag.

@feng-tao
Copy link
Member

@ashb Ah I see, thanks for the info. Yeah, internally we still emit the metric from webserver side instead of scheduler side hence didn't observe this issue. And I wonder why we rename the metric name?

- `dag_file_processor_timeouts` -- use `dag_processing.processor_timeouts` instead
- `collect_dags` -- use `dag_processing.total_parse_time` instead
- `dag.loading-duration.<basename>` -- use `dag_processing.last_duration.<basename>` instead
- `dag_processing.last_runtime.<basename>` -- use `dag_processing.last_duration.<basename>` instead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?
both L50 and L51 are the same name now, I wonder could we keep loading-duration or loading_duration and last_runtime, but change from dag to dag_processing? Or anything I missed here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics were almost identicaly.

loading-duration:

Stats.timing('dag.loading-duration.{}'.
format(filename),
file_stat.duration)

td = timezone.utcnow() - ts
td = td.total_seconds() + (
float(td.microseconds) / 1000000)
stats.append(FileLoadStat(
filepath.replace(dag_folder, ''),
td,

So loading-duration is the time taken for the process_file() call.

last_runtime:

Stats.gauge(
'dag_processing.last_runtime.{}'.format(file_name),
last_runtime
)

self._last_runtime[file_path] = (now -
processor.start_time).total_seconds()

last_runtime was the time taken to by the subprocess that calls process_file(), so would always be strictly greater than last-duration.

Having two metrics that were almost but recorded slightly different things seemed more confusing than not to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I C

@feng-tao
Copy link
Member

And haven't followed with latest progress, but is the stateless webserver code in 1.10.5?

@ashb
Copy link
Member Author

ashb commented Sep 20, 2019

Stateless webserver is not in 1.10.5 but will be an option for 1.10.6 (or more likely .7)

UPDATING.md Outdated Show resolved Hide resolved
@kaxil
Copy link
Member

kaxil commented Sep 21, 2019

Can we also update https://github.com/apache/airflow/blob/master/docs/metrics.rst please?

@ashb
Copy link
Member Author

ashb commented Sep 23, 2019

Can we also update https://github.com/apache/airflow/blob/master/docs/metrics.rst please?

D'oh yes. I had already made the change but lost it when rebasing my change to latest master. Added back.

@ashb
Copy link
Member Author

ashb commented Sep 23, 2019

And I wonder why we rename the metric name?

@feng-tao Mostly for consistency: dash vs underscore, dag. prefix vs dag_processing prefix.

With this change all the metrics related to timing of dag processing/importing are now under dag_processing. (with the exception of dagbag_size which I haven't changed. Maybe I should?)

@ashb ashb force-pushed the dagparsing-metrics branch 4 times, most recently from 6ae3198 to 2428806 Compare September 23, 2019 11:46
Since we switched to using sub-processes to parse the DAG files sometime
back in 2016(!) the metrics we have been emitting about dag bag size and
parsing have been incorrect.

We have also been emitting metrics from the webserver which is going to
be become wrong as we move towards a stateless webserver.

To fix both of these issues I have stopped emitting the metrics from
models.DagBag and only emit them from inside the
DagFileProcessorManager.

(There was also a bug in the `dag.loading-duration.*` we were emitting
from the DagBag code where the "dag_file" part of that metric was empty.
I have fixed that even though I have now deprecated that metric)
@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@7506c95). Click here to learn what that means.
The diff coverage is 24.63%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #6157   +/-   ##
=========================================
  Coverage          ?   79.72%           
=========================================
  Files             ?      608           
  Lines             ?    35072           
  Branches          ?        0           
=========================================
  Hits              ?    27961           
  Misses            ?     7111           
  Partials          ?        0
Impacted Files Coverage Δ
airflow/models/dagbag.py 92.38% <ø> (ø)
airflow/jobs/scheduler_job.py 72.74% <0%> (ø)
airflow/utils/dag_processing.py 53.44% <25.75%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7506c95...3088eb9. Read the comment docs.

@ashb
Copy link
Member Author

ashb commented Sep 23, 2019

Hooray tests green! @feng-tao What do you think about the renamed metrics - worth updating them or better keeping them as they are?

@feng-tao
Copy link
Member

LGTM

@ashb ashb merged commit 5f9ab7a into apache:master Sep 24, 2019
@ashb ashb deleted the dagparsing-metrics branch September 24, 2019 09:23
ashb added a commit to ashb/airflow that referenced this pull request Sep 24, 2019
Since we switched to using sub-processes to parse the DAG files sometime
back in 2016(!) the metrics we have been emitting about dag bag size and
parsing have been incorrect.

We have also been emitting metrics from the webserver which is going to
be become wrong as we move towards a stateless webserver.

To fix both of these issues I have stopped emitting the metrics from
models.DagBag and only emit them from inside the
DagFileProcessorManager.

(There was also a bug in the `dag.loading-duration.*` we were emitting
from the DagBag code where the "dag_file" part of that metric was empty.
I have fixed that even though I have now deprecated that metric. The
webserver was emitting the right metric though so many people wouldn't
notice)

(cherry picked from commit 5f9ab7a)
ashb added a commit to astronomer/airflow that referenced this pull request Nov 29, 2020
These were deprecated in 1.10.6 via apache#6157, so we should remove them
before 2.0 rolls around.
ashb added a commit that referenced this pull request Nov 30, 2020
These were deprecated in 1.10.6 via #6157, so we should remove them
before 2.0 rolls around.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants