Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Possible regression in dev arrow #43627

Closed
thisisnic opened this issue Aug 9, 2024 · 6 comments
Closed

[R] Possible regression in dev arrow #43627

thisisnic opened this issue Aug 9, 2024 · 6 comments
Assignees
Milestone

Comments

@thisisnic
Copy link
Member

Describe the bug, including details regarding any error messages, version, and platform.

I just ran this with dev arrow and it took 45 seconds on my first run and 85 seconds on my second run:

library(arrow)
library(dplyr)
library(tictoc)

nyc_taxi <- open_dataset("data/nyc-taxi/")

tic()
nyc_taxi |>
  group_by(year) |>
  summarise(
    all_trips = n(),
    shared_trips = sum(passenger_count > 1, na.rm= TRUE)
  ) |>
  mutate(pct_shared = shared_trips / all_trips * 1) |>
  collect()
toc()

If I do it with 16.1.0 it only took 5 seconds on both runs.

Component(s)

R

@nealrichardson
Copy link
Member

Could you do show_query() on both versions?

@thisisnic
Copy link
Member Author

thisisnic commented Aug 9, 2024

Here's the one from 16.1.0:

ExecPlan with 5 nodes:
4:SinkNode{}
  3:ProjectNode{projection=[year, all_trips, shared_trips, "pct_shared": multiply_checked(divide(cast(shared_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(all_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})), 1)]}
    2:GroupByNode{keys=["year"], aggregates=[
    	hash_count_all(*),
    	hash_sum(shared_trips, {skip_nulls=true, min_count=0}),
    ]}
      1:ProjectNode{projection=["shared_trips": (passenger_count > 1), year]}
        0:SourceNode{}

Will have to come back to the other one during/after posit::conf if nobody else gets there first as I don't have time now.

@nealrichardson
Copy link
Member

This is on dev. At least the part I was looking at (the 1:ProjectNode) is the same:

ExecPlan with 5 nodes:
4:SinkNode{}
  3:ProjectNode{projection=[year, all_trips, shared_trips, "pct_shared": multiply_checked(divide(cast(shared_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(all_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})), 1)]}
    2:GroupByNode{keys=["year"], aggregates=[
    	hash_count_all(*),
    	hash_sum(shared_trips, {skip_nulls=true, min_count=0}),
    ]}
      1:ProjectNode{projection=["shared_trips": (passenger_count > 1), year]}
        0:SourceNode{}

Will have to do some profiling to see what's happening. The obvious suspect is #41576, but I don't have an intuition as to how it's responsible.

@nealrichardson
Copy link
Member

I think there's something off in the C++ library that's responsible for this regression. I ran profvis on this query on the CRAN 16.1.0 version and a dev build, and the hotspot is RecordBatchReader$read_table called inside collect(), i.e. Acero is slower in evaluating the query and returning the batches. It does not seem to be anything in the R layer.

I'll try bisecting history to find out which commit introduces this.

@nealrichardson
Copy link
Member

Nope, it is #41576 somehow. That's the commit where this query gets slow. I will try to understand better and get a fix.

@nealrichardson
Copy link
Member

I think I've figured it out. The query plans look the same but they're not--the 0:SourceNode doesn't show what options it has, and that's concealing what's up. (I thought I wrote an issue years ago about improving those print methods but I can't find it now.) The old path did a dplyr::select() to subset the columns before doing the aggregation, and that projection is getting pushed down into the SourceNode. But the 1:ProjectNode projection, which selects the columns for the 2:GroupByNode aggregation, doesn't get pushed down. So without that select(), we're reading a bunch of data into memory from the Parquet files that we immediately throw away.

I'll make a PR to add back that select(), and I'll fix the comment from before that was misleading and led me to believe that it wasn't doing anything of value.

nealrichardson added a commit that referenced this issue Aug 14, 2024
### Rationale for this change

See #43627 (comment)

### What changes are included in this PR?

An extra `dplyr::select()`

### Are these changes tested?

Conbench should show that the performance is much better

### Are there any user-facing changes?

Not slow
* GitHub Issue: #43627
@nealrichardson nealrichardson added this to the 18.0.0 milestone Aug 14, 2024
jonkeane pushed a commit that referenced this issue Aug 14, 2024
### Rationale for this change

See #43627 (comment)

### What changes are included in this PR?

An extra `dplyr::select()`

### Are these changes tested?

Conbench should show that the performance is much better

### Are there any user-facing changes?

Not slow
* GitHub Issue: #43627
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants