-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-774] Fix long-broken DAG parsing Statsd metrics #6157
Conversation
@@ -389,24 +388,19 @@ def collect_dags( | |||
td = td.total_seconds() + ( | |||
float(td.microseconds) / 1000000) | |||
stats.append(FileLoadStat( | |||
filepath.replace(dag_folder, ''), | |||
filepath.replace(settings.DAGS_FOLDER, ''), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milton0825 This was causing the parse time to metric to be emited as dag.loading-duration.
. Whoops :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.....
047e99d
to
314e450
Compare
@ashb could you elaborate on why the dag parsing time metric is incorrect(lyft has been used for a while)? I could understand why Dag bag size metric is wrong. But my understanding is that even if we have multi process parse dag bag, each dag will only get parsed by one processed. |
314e450
to
527e9ab
Compare
@feng-tao The same metric is emited in multiple places. If you look you will find both (one eith the filename and one without) Maybe statsd ignores that one (I was looking at the metrics emitted via running netcat). But to give it more detail. Take a dag_folder of /opt/airflow/dags and /opt/airflow/dags/dag1.py
|
@ashb Ah I see, thanks for the info. Yeah, internally we still emit the metric from webserver side instead of scheduler side hence didn't observe this issue. And I wonder why we rename the metric name? |
- `dag_file_processor_timeouts` -- use `dag_processing.processor_timeouts` instead | ||
- `collect_dags` -- use `dag_processing.total_parse_time` instead | ||
- `dag.loading-duration.<basename>` -- use `dag_processing.last_duration.<basename>` instead | ||
- `dag_processing.last_runtime.<basename>` -- use `dag_processing.last_duration.<basename>` instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
both L50 and L51 are the same name now, I wonder could we keep loading-duration or loading_duration and last_runtime, but change from dag to dag_processing? Or anything I missed here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metrics were almost identicaly.
loading-duration
:
airflow/airflow/models/dagbag.py
Lines 411 to 413 in 23ec78a
Stats.timing('dag.loading-duration.{}'. | |
format(filename), | |
file_stat.duration) |
airflow/airflow/models/dagbag.py
Lines 388 to 393 in 23ec78a
td = timezone.utcnow() - ts | |
td = td.total_seconds() + ( | |
float(td.microseconds) / 1000000) | |
stats.append(FileLoadStat( | |
filepath.replace(dag_folder, ''), | |
td, |
So loading-duration is the time taken for the process_file()
call.
last_runtime
:
airflow/airflow/utils/dag_processing.py
Lines 950 to 953 in f497d1d
Stats.gauge( | |
'dag_processing.last_runtime.{}'.format(file_name), | |
last_runtime | |
) |
airflow/airflow/utils/dag_processing.py
Lines 1113 to 1114 in f497d1d
self._last_runtime[file_path] = (now - | |
processor.start_time).total_seconds() |
last_runtime
was the time taken to by the subprocess that calls process_file()
, so would always be strictly greater than last-duration.
Having two metrics that were almost but recorded slightly different things seemed more confusing than not to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I C
And haven't followed with latest progress, but is the stateless webserver code in 1.10.5? |
Stateless webserver is not in 1.10.5 but will be an option for 1.10.6 (or more likely .7) |
Can we also update https://github.com/apache/airflow/blob/master/docs/metrics.rst please? |
D'oh yes. I had already made the change but lost it when rebasing my change to latest master. Added back. |
@feng-tao Mostly for consistency: dash vs underscore, With this change all the metrics related to timing of dag processing/importing are now under |
6ae3198
to
2428806
Compare
Since we switched to using sub-processes to parse the DAG files sometime back in 2016(!) the metrics we have been emitting about dag bag size and parsing have been incorrect. We have also been emitting metrics from the webserver which is going to be become wrong as we move towards a stateless webserver. To fix both of these issues I have stopped emitting the metrics from models.DagBag and only emit them from inside the DagFileProcessorManager. (There was also a bug in the `dag.loading-duration.*` we were emitting from the DagBag code where the "dag_file" part of that metric was empty. I have fixed that even though I have now deprecated that metric)
2428806
to
3088eb9
Compare
Codecov Report
@@ Coverage Diff @@
## master #6157 +/- ##
=========================================
Coverage ? 79.72%
=========================================
Files ? 608
Lines ? 35072
Branches ? 0
=========================================
Hits ? 27961
Misses ? 7111
Partials ? 0
Continue to review full report at Codecov.
|
Hooray tests green! @feng-tao What do you think about the renamed metrics - worth updating them or better keeping them as they are? |
LGTM |
Since we switched to using sub-processes to parse the DAG files sometime back in 2016(!) the metrics we have been emitting about dag bag size and parsing have been incorrect. We have also been emitting metrics from the webserver which is going to be become wrong as we move towards a stateless webserver. To fix both of these issues I have stopped emitting the metrics from models.DagBag and only emit them from inside the DagFileProcessorManager. (There was also a bug in the `dag.loading-duration.*` we were emitting from the DagBag code where the "dag_file" part of that metric was empty. I have fixed that even though I have now deprecated that metric. The webserver was emitting the right metric though so many people wouldn't notice) (cherry picked from commit 5f9ab7a)
These were deprecated in 1.10.6 via apache#6157, so we should remove them before 2.0 rolls around.
These were deprecated in 1.10.6 via #6157, so we should remove them before 2.0 rolls around.
Make sure you have checked all steps below.
Jira
Description
Here are some details about my PR, including screenshots of any UI changes:
Since we switched to using sub-processes to parse the DAG files sometime
back in 2016(!) the metrics we have been emitting about dag bag size and
parsing have been incorrect.
We have also been emitting metrics from the webserver which is going to
be become wrong as we move towards a stateless webserver.
To fix both of these issues I have stopped emitting the metrics from
models.DagBag and only emit them from inside the
DagFileProcessorManager.
(There was also a bug in the
dag.loading-duration.*
we were emittingfrom the DagBag code where the "dag_file" part of that metric was empty.
I have fixed that even though I have now deprecated that metric)
Tests
Commits
Documentation