-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet: Add finer metrics on operations covered by time_elapsed_opening
#12585
Conversation
As it does not measure evaluation of row group-level pushdown filters
time_elapsed_opening
time_elapsed_opening
ea09cf7
to
b9b7156
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @progval
I have some small suggestions but I don't think they are required to merge this PR
@@ -43,14 +43,20 @@ pub struct ParquetFileMetrics { | |||
pub pushdown_rows_pruned: Count, | |||
/// Total rows passed predicates pushed into parquet scan | |||
pub pushdown_rows_matched: Count, | |||
/// Total time spent evaluating pushdown filters | |||
pub pushdown_eval_time: Time, | |||
/// Total time spent evaluating row-level pushdown filters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside while I am reviewing this code, the description of what the metrics mean would be more useful if we could attach them to the Metric itself somehow 🤔
Something like:
let row_pushdown_eval_time = MetricBuilder::new(metrics)
.with_description("Total time spent evaluating row-level pushdown filters")
.with_new_label("filename", filename.to_string())
.subset_time("row_pushdown_eval_time", partition);
Though then we would have to figure out how to expose that information in an explain plan 🤔
|
||
if enable_bloom_filter && !row_groups.is_empty() { | ||
let mut timer = file_metrics.bloom_filter_eval_time.timer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I note that the time calculation for page_index_eval_time
below seems to be in the call to
access_plan = p.prune_plan_with_page_index(
access_plan,
&file_schema,
builder.parquet_schema(),
file_metadata.as_ref(),
&file_metrics,
);
I wonder if we should move the timer into this function for consistency (or alternatively push the timer calculation into prune_by_bloom_filters
and prune_by_statistics
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I did the latter, because it's probably closer to what users expect.
row_groups.prune_by_statistics( | ||
&file_schema, | ||
builder.parquet_schema(), | ||
rg_metadata, | ||
predicate, | ||
&file_metrics, | ||
); | ||
timer.stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a problem, but technically speaking the stop()
is unecessary (it happens automatically when timer
goes out of scope and is Drop
ed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is needed, because it only goes out of scope after the Bloom Filter evaluation. The next timer.stop()
is indeed redundant, but I kept it for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter anymore, as I moved both calls to the callees.
Thanks again @progval |
…ning` (apache#12585) * Rename 'pushdown_eval_time' to 'row_pushdown_eval_time' As it does not measure evaluation of row group-level pushdown filters * Add statistics_eval_time and bloom_filter_eval_time metrics * Add metadata_load_time metric * Add docs * Move timers to the callees
Which issue does this PR close?
Closes #12584.
Rationale for this change
This allows both Datafusion developers and users to better measure the impact of their optimizations. For example, I am working on #12547 (caching the page index), and my patch makes these metrics go from:
to
which shows it reduced
metadata_load_time
andtime_elapsed_opening
with little impact to the others.What changes are included in this PR?
metadata_load_time
statistics_eval_time
andbloom_filter_eval_time
pushdown_eval_time
torow_pushdown_eval_time
, because statistics and eval time are pushdown filters tooAre these changes tested?
Not really. There is no unit tests, but it works in datafusion-cli and as a library
Are there any user-facing changes?
Renamed a metric, and added two others, visible in
EXPLAIN ANALYZE
.