[R] Possible regression in dev arrow #43627

thisisnic · 2024-08-09T13:23:38Z

Describe the bug, including details regarding any error messages, version, and platform.

I just ran this with dev arrow and it took 45 seconds on my first run and 85 seconds on my second run:

library(arrow)
library(dplyr)
library(tictoc)

nyc_taxi <- open_dataset("data/nyc-taxi/")

tic()
nyc_taxi |>
  group_by(year) |>
  summarise(
    all_trips = n(),
    shared_trips = sum(passenger_count > 1, na.rm= TRUE)
  ) |>
  mutate(pct_shared = shared_trips / all_trips * 1) |>
  collect()
toc()

If I do it with 16.1.0 it only took 5 seconds on both runs.

Component(s)

R

The text was updated successfully, but these errors were encountered:

nealrichardson · 2024-08-09T13:26:36Z

Could you do show_query() on both versions?

thisisnic · 2024-08-09T13:32:27Z

Here's the one from 16.1.0:

ExecPlan with 5 nodes:
4:SinkNode{}
  3:ProjectNode{projection=[year, all_trips, shared_trips, "pct_shared": multiply_checked(divide(cast(shared_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(all_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})), 1)]}
    2:GroupByNode{keys=["year"], aggregates=[
    	hash_count_all(*),
    	hash_sum(shared_trips, {skip_nulls=true, min_count=0}),
    ]}
      1:ProjectNode{projection=["shared_trips": (passenger_count > 1), year]}
        0:SourceNode{}

Will have to come back to the other one during/after posit::conf if nobody else gets there first as I don't have time now.

nealrichardson · 2024-08-09T13:45:38Z

This is on dev. At least the part I was looking at (the 1:ProjectNode) is the same:

ExecPlan with 5 nodes:
4:SinkNode{}
  3:ProjectNode{projection=[year, all_trips, shared_trips, "pct_shared": multiply_checked(divide(cast(shared_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(all_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})), 1)]}
    2:GroupByNode{keys=["year"], aggregates=[
    	hash_count_all(*),
    	hash_sum(shared_trips, {skip_nulls=true, min_count=0}),
    ]}
      1:ProjectNode{projection=["shared_trips": (passenger_count > 1), year]}
        0:SourceNode{}

Will have to do some profiling to see what's happening. The obvious suspect is #41576, but I don't have an intuition as to how it's responsible.

nealrichardson · 2024-08-10T15:50:19Z

I think there's something off in the C++ library that's responsible for this regression. I ran profvis on this query on the CRAN 16.1.0 version and a dev build, and the hotspot is RecordBatchReader$read_table called inside collect(), i.e. Acero is slower in evaluating the query and returning the batches. It does not seem to be anything in the R layer.

I'll try bisecting history to find out which commit introduces this.

nealrichardson · 2024-08-10T16:31:46Z

Nope, it is #41576 somehow. That's the commit where this query gets slow. I will try to understand better and get a fix.

nealrichardson · 2024-08-12T15:17:18Z

I think I've figured it out. The query plans look the same but they're not--the 0:SourceNode doesn't show what options it has, and that's concealing what's up. (I thought I wrote an issue years ago about improving those print methods but I can't find it now.) The old path did a dplyr::select() to subset the columns before doing the aggregation, and that projection is getting pushed down into the SourceNode. But the 1:ProjectNode projection, which selects the columns for the 2:GroupByNode aggregation, doesn't get pushed down. So without that select(), we're reading a bunch of data into memory from the Parquet files that we immediately throw away.

I'll make a PR to add back that select(), and I'll fix the comment from before that was misleading and led me to believe that it wasn't doing anything of value.

### Rationale for this change See #43627 (comment) ### What changes are included in this PR? An extra `dplyr::select()` ### Are these changes tested? Conbench should show that the performance is much better ### Are there any user-facing changes? Not slow * GitHub Issue: #43627

thisisnic added the Type: bug label Aug 9, 2024

github-actions bot added the Component: R label Aug 9, 2024

nealrichardson mentioned this issue Aug 12, 2024

GH-43627: [R] Fix summarize() performance regression (pushdown) #43649

Merged

github-actions bot assigned nealrichardson Aug 12, 2024

nealrichardson mentioned this issue Aug 12, 2024

[C++][Acero] SourceNode ToString should print *something* useful #43664

Open

nealrichardson added this to the 18.0.0 milestone Aug 14, 2024

nealrichardson closed this as completed Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Possible regression in dev arrow #43627

[R] Possible regression in dev arrow #43627

thisisnic commented Aug 9, 2024

nealrichardson commented Aug 9, 2024

thisisnic commented Aug 9, 2024 •

edited

Loading

nealrichardson commented Aug 9, 2024

nealrichardson commented Aug 10, 2024

nealrichardson commented Aug 10, 2024

nealrichardson commented Aug 12, 2024

[R] Possible regression in dev arrow #43627

[R] Possible regression in dev arrow #43627

Comments

thisisnic commented Aug 9, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

nealrichardson commented Aug 9, 2024

thisisnic commented Aug 9, 2024 • edited Loading

nealrichardson commented Aug 9, 2024

nealrichardson commented Aug 10, 2024

nealrichardson commented Aug 10, 2024

nealrichardson commented Aug 12, 2024

thisisnic commented Aug 9, 2024 •

edited

Loading