Skip to content

Commit

Permalink
GH-42240: [R] Fix crash in ParquetFileWriter$WriteTable and add Write…
Browse files Browse the repository at this point in the history
…Batch (#42241)

### Rationale for this change

See #42240.

### What changes are included in this PR?

- Fixes a crash in `ParquetFileWriter$WriteTable` by asserting the class of what's passed in and stopping if it's not a `Table`
- Since I was already there, added `WriteBatch` to match [`pyarrow.parquet.ParquetWriter.write_batch`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html#pyarrow.parquet.ParquetWriter.write_batch) which is just a convenience
- Adds a test for the behavior of trying to write to a closed sink
- Bumps the minimum Arrow C++ test version we test the R package with on CI from 13 to 15
- Removes one ARROW_VERSION_MAJOR >= 15 guard

### Are these changes tested?

Yes.

### Are there any user-facing changes?

New method on ParquetFileWriter (WriteBatch).
* GitHub Issue: #42240

Authored-by: Bryce Mecum <[email protected]>
Signed-off-by: Jonathan Keane <[email protected]>
  • Loading branch information
amoeba authored Jul 10, 2024
1 parent 788c8f2 commit 84df343
Show file tree
Hide file tree
Showing 8 changed files with 49 additions and 13 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/r.yml
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ jobs:
strategy:
matrix:
include:
- cpp_version: "13.0.0"
- cpp_version: "15.0.2"
steps:
- name: Checkout Arrow
uses: actions/checkout@3df4ab11eba7bda6032a0b82a6bb43b11571feac # v4.0.0
Expand Down
1 change: 1 addition & 0 deletions r/NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
* `summarize()` supports more complex expressions, and correctly handles cases where column names are reused in expressions.
* The `na_matches` argument to the `dplyr::*_join()` functions is now supported. This argument controls whether `NA` values are considered equal when joining. (#41358)
* R metadata, stored in the Arrow schema to support round-tripping data between R and Arrow/Parquet, is now serialized and deserialized more strictly. This makes it safer to load data from files from unknown sources into R data.frames. (#41969)
* The minimum version of the Arrow C++ library the Arrow R package can be built with has been bumped to 15.0.0 (#42241)

# arrow 16.1.0

Expand Down
7 changes: 7 additions & 0 deletions r/R/parquet.R
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,7 @@ ParquetWriterProperties$create <- function(column_names,
#' @section Methods:
#'
#' - `WriteTable` Write a [Table] to `sink`
#' - `WriteBatch` Write a [RecordBatch] to `sink`
#' - `Close` Close the writer. Note: does not close the `sink`.
#' [arrow::io::OutputStream][OutputStream] has its own `close()` method.
#'
Expand All @@ -428,8 +429,14 @@ ParquetFileWriter <- R6Class("ParquetFileWriter",
inherit = ArrowObject,
public = list(
WriteTable = function(table, chunk_size) {
assert_is(table, "Table")
parquet___arrow___FileWriter__WriteTable(self, table, chunk_size)
},
WriteBatch = function(batch, ...) {
assert_is(batch, "RecordBatch")
table <- Table$create(batch)
self$WriteTable(table, ...)
},
Close = function() parquet___arrow___FileWriter__Close(self)
)
)
Expand Down
1 change: 1 addition & 0 deletions r/man/ParquetFileWriter.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 0 additions & 9 deletions r/src/r_to_arrow.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1050,7 +1050,6 @@ class RDictionaryConverter<ValueType, enable_if_has_string_view<ValueType>>
template <typename T, typename Enable = void>
struct RConverterTrait;

#if ARROW_VERSION_MAJOR >= 15
template <typename T>
struct RConverterTrait<
T, enable_if_t<!is_nested_type<T>::value && !is_interval_type<T>::value &&
Expand All @@ -1062,14 +1061,6 @@ template <typename T>
struct RConverterTrait<T, enable_if_binary_view_like<T>> {
// not implemented
};
#else
template <typename T>
struct RConverterTrait<
T, enable_if_t<!is_nested_type<T>::value && !is_interval_type<T>::value &&
!is_extension_type<T>::value>> {
using type = RPrimitiveConverter<T>;
};
#endif

template <typename T>
struct RConverterTrait<T, enable_if_list_like<T>> {
Expand Down
28 changes: 28 additions & 0 deletions r/tests/testthat/test-parquet.R
Original file line number Diff line number Diff line change
Expand Up @@ -530,3 +530,31 @@ test_that("thrift string and container size can be specified when reading Parque
data <- reader_container$ReadTable()
expect_identical(collect.ArrowTabular(data), example_data)
})

test_that("We can use WriteBatch on ParquetFileWriter", {
tf <- tempfile()
on.exit(unlink(tf))
sink <- FileOutputStream$create(tf)
sch <- schema(a = int32())
props <- ParquetWriterProperties$create(column_names = names(sch))
writer <- ParquetFileWriter$create(schema = sch, sink = sink, properties = props)

batch <- RecordBatch$create(data.frame(a = 1:10))
writer$WriteBatch(batch, chunk_size = 10)
writer$WriteBatch(batch, chunk_size = 10)
writer$WriteBatch(batch, chunk_size = 10)
writer$Close()

tbl <- read_parquet(tf)
expect_equal(nrow(tbl), 30)
})

test_that("WriteBatch on ParquetFileWriter errors when called on closed sink", {
sink <- FileOutputStream$create(tempfile())
sch <- schema(a = int32())
props <- ParquetWriterProperties$create(column_names = names(sch))
writer <- ParquetFileWriter$create(schema = sch, sink = sink, properties = props)
writer$Close()
batch <- RecordBatch$create(data.frame(a = 1:10))
expect_error(writer$WriteBatch(batch, chunk_size = 10), "Operation on closed file")
})
4 changes: 2 additions & 2 deletions r/tools/check-versions.R
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@ release_version_supported <- function(r_version, cpp_version) {
r_version <- package_version(r_version)
cpp_version <- package_version(cpp_version)
major <- function(x) as.numeric(x[1, 1])
minimum_cpp_version <- package_version("13.0.0")
minimum_cpp_version <- package_version("15.0.0")

allow_mismatch <- identical(tolower(Sys.getenv("ARROW_R_ALLOW_CPP_VERSION_MISMATCH", "false")), "true")
# If we allow a version mismatch we still need to cover the minimum version (13.0.0 for now)
# If we allow a version mismatch we still need to cover the minimum version (15.0.0 for now)
# we don't allow newer C++ versions as new features without additional feature gates are likely to
# break the R package
version_valid <- cpp_version >= minimum_cpp_version && major(cpp_version) <= major(r_version)
Expand Down
10 changes: 9 additions & 1 deletion r/tools/test-check-versions.R
Original file line number Diff line number Diff line change
Expand Up @@ -61,16 +61,24 @@ test_that("check_versions without mismatch", {
test_that("check_versions with mismatch", {
withr::local_envvar(.new = c(ARROW_R_ALLOW_CPP_VERSION_MISMATCH = "false"))

expect_true(
release_version_supported("15.0.0", "15.0.0")
)

expect_false(
release_version_supported("15.0.0", "13.0.0")
)

withr::local_envvar(.new = c(ARROW_R_ALLOW_CPP_VERSION_MISMATCH = "true"))

expect_true(
expect_false(
release_version_supported("15.0.0", "13.0.0")
)

expect_true(
release_version_supported("16.0.0", "15.0.0")
)

expect_false(
release_version_supported("15.0.0", "16.0.0")
)
Expand Down

0 comments on commit 84df343

Please sign in to comment.