Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-42240: [R] Fix crash in ParquetFileWriter$WriteTable and add WriteBatch #42241

Merged
merged 14 commits into from
Jul 10, 2024

Conversation

amoeba
Copy link
Member

@amoeba amoeba commented Jun 21, 2024

Rationale for this change

See #42240.

What changes are included in this PR?

  • Fixes a crash in ParquetFileWriter$WriteTable by asserting the class of what's passed in and stopping if it's not a Table
  • Since I was already there, added WriteBatch to match pyarrow.parquet.ParquetWriter.write_batch which is just a convenience
  • Adds a test for the behavior of trying to write to a closed sink
  • Bumps the minimum Arrow C++ test version we test the R package with on CI from 13 to 15
  • Removes one ARROW_VERSION_MAJOR >= 15 guard

Are these changes tested?

Yes.

Are there any user-facing changes?

New method on ParquetFileWriter (WriteBatch).

Copy link

⚠️ GitHub issue #42240 has been automatically assigned in GitHub to PR creator.

r/R/parquet.R Outdated Show resolved Hide resolved
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for jumping on this! I have one comment, feel free to take it or leave it.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting merge Awaiting merge and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Jun 22, 2024
@amoeba
Copy link
Member Author

amoeba commented Jul 2, 2024

I pushed up a change addressing that comment and noticed I missed documenting the new class method so I added another commit for that. Letting CI run and then I'll merge.

@amoeba
Copy link
Member Author

amoeba commented Jul 3, 2024

One of the tests this PR adds, in conjunction with the minimum supported C++ version check, caught a segfault in Arrow C++ that wasn't fixed until Arrow 15. I think we need to put a closed-check in the R package somewhere to catch this. I'll look into that and include a note that the check can be removed once we bump the minimum version to or above 15.

@amoeba
Copy link
Member Author

amoeba commented Jul 3, 2024

It wasn't easy to expose the writer's closed_ state to the R package for Arrow C++ 13/14 so I opted just to skip the test on Arrow C++ <15, see 22a706b. @jonkeane can you take a quick look?

@amoeba amoeba requested a review from jonkeane July 3, 2024 20:49
@jonkeane
Copy link
Member

jonkeane commented Jul 4, 2024

It wasn't easy to expose the writer's closed_ state to the R package for Arrow C++ 13/14 so I opted just to skip the test on Arrow C++ <15, see 22a706b. @jonkeane can you take a quick look?

Does this mean we should bump our minimum libarrow to 15 then? If we can't support it, that's the way to go IMO. The backwards compatibility with libarrow is nice, but as far as I know it's not used anywhere specifically. It was added with the hopes that we might use system-library libarrow on CRAN among other places, but that hasn't happened.

@amoeba
Copy link
Member Author

amoeba commented Jul 4, 2024

Hrm, maybe. I think the only thing to do other than that would be to duplicate the open/closed state tracking on the R side until we bump the minimum version for some other reason. I think it may be relevant that this segfault is just in the manual invocation of the ParquetFileWriter and I'm not sure if that impacts the decision-making process here.

@jonkeane
Copy link
Member

jonkeane commented Jul 4, 2024

Do we know of any actual uses of the backwards compatibility in the wild?

@amoeba
Copy link
Member Author

amoeba commented Jul 4, 2024

I'm not aware of any.

@jonkeane
Copy link
Member

jonkeane commented Jul 4, 2024

I'm not aware of any.

In that case I think we should just require >15 and that's ok (that might also let us delete some of the ifdefs too).

@amoeba amoeba force-pushed the feature/r-parquet-write-batch branch from 22a706b to e3c460d Compare July 9, 2024 20:29
If you use 15.0.0 here like you'd expect you could, you get

root@0347b6360c69:/# apt install -y -V libarrow-dev=15.0.0-1 \
                               libarrow-acero-dev=15.0.0-1 \
                               libparquet-dev=15.0.0-1 \
                               libarrow-dataset-dev=15.0.0-1
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libarrow-acero-dev : Depends: libarrow-acero1500 (= 15.0.0-1) but 15.0.2-1 is to be installed
 libarrow-dataset-dev : Depends: libarrow-dataset1500 (= 15.0.0-1) but 15.0.2-1 is to be installed
 libarrow-dev : Depends: libarrow1500 (= 15.0.0-1) but 15.0.2-1 is to be installed
 libparquet-dev : Depends: libparquet1500 (= 15.0.0-1) but 15.0.2-1 is to be installed
E: Unable to correct problems, you have held broken packages.
@amoeba
Copy link
Member Author

amoeba commented Jul 9, 2024

Okay @jonkeane, I've bumped the minimum version from 13 to 15 and updated the PR body. There was only one ARROW_VERSION_MAJOR guard to be removed which I removed.

@jonkeane
Copy link
Member

jonkeane commented Jul 9, 2024

Thanks! Sorry I should have linked this before, but we also need to change

minimum_cpp_version <- package_version("13.0.0")
too yeah?

@amoeba
Copy link
Member Author

amoeba commented Jul 9, 2024

My fault for missing that. I changed that and added a NEWS.md entry while I was at it.

@amoeba
Copy link
Member Author

amoeba commented Jul 10, 2024

I think CI is happy now that I pushed c2d39d0.

@jonkeane jonkeane merged commit 84df343 into apache:main Jul 10, 2024
10 checks passed
@jonkeane jonkeane removed the awaiting merge Awaiting merge label Jul 10, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 84df343.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants