GH-43349: [R] Fix altrep string columns from readr #43351

jonkeane · 2024-07-21T02:46:13Z

Rationale for this change

To resolve the reverse dependency issue with parquetize

What changes are included in this PR?

One step towards resolving the issue

Are these changes tested?

yes

Are there any user-facing changes?

no

GitHub Issue: [R] String columns read lazily from readr error when transferred to an arrow table #43349

github-actions · 2024-07-21T02:46:38Z

⚠️ GitHub issue #43349 has been automatically assigned in GitHub to PR creator.

jonkeane · 2024-07-21T02:50:16Z

r/src/arrow_cpp11.h

@@ -148,7 +148,7 @@ inline SEXP utf8_strings(SEXP x) {

    for (R_xlen_t i = 0; i < n; i++, ++p_x) {
      SEXP s = *p_x;
-      if (s != NA_STRING) {
+      if (s != NA_STRING && ALTREP(s)) {


This does not actually work yet. With this

arrow/r/tests/testthat/test-utf.R

Lines 19 to 38 in c3ebdf5

test_that("We handle non-UTF strings", {

x <- iconv("Veitingastaðir", to = "latin1")

df <- tibble::tibble(

chr = x,

fct = as.factor(x)

)

names(df) <- iconv(paste(x, names(df), sep = "_"), to = "latin1")

df_struct <- tibble::tibble(a = df)

raw_schema <- list(utf8(), dictionary(int8(), utf8()))

names(raw_schema) <- names(df)

# Confirm setup

expect_identical(Encoding(x), "latin1")

expect_identical(Encoding(names(df)), c("latin1", "latin1"))

expect_identical(Encoding(df[[1]]), "latin1")

expect_identical(Encoding(levels(df[[2]])), "latin1")

# Array

expect_identical(as.vector(Array$create(x)), x)

fail.

Before #43173 this line was:

if (s != NA_STRING && !IS_UTF8(s) && !IS_ASCII(s)) {

I suspect what's going on is that we caught vroom altrep vectors with this if and called SET_STRING_ELT so we didn't get the No Set_elt found for ALTSTRING error. But which ALTREP(s) we detect the non-utf strings here too and attempt to SET_STRING_ELT when we shouldn't.

Any thoughts @nealrichardson ?

I guess I don't understand what's happening in here. L144 says that this should no longer be altrep by the time we get here.

I also think that STRING_PTR_RO() should materialize an ALTREP character vector, otherwise you can't iterate over the CHARSXP pointers.

Also, ALTREP(s) does not seem to make sense to me, that's a CHARSXP (not a STRSXP), so it cannot be ALTREP, unless I am misremembering something.

Are you trying to catch the non-UTF-8 strings here? Rf_translateCharUTF8() does not do anything (returns the same const char *) if the string is UTF-8, so you could always call it, and only call SET_STRING_ELT() if the returned pointer is different?

I also think that STRING_PTR_RO() should materialize an ALTREP character vector, otherwise you can't iterate over the CHARSXP pointers.

Hmm, so maybe the actual problem here is that that isn't working on the ALTREP vectors coming from vroom? Here is the error we started seeing when we changed this condition from s != NA_STRING && !IS_UTF8(s) && !IS_ASCII(s) to s != NA_STRING:

Error in Table__from_dots(dots, schema, option_use_threads()): No Set_elt found for ALTSTRING class [class: vroom_chr, pkg: vroom]

Also, ALTREP(s) does not seem to make sense to me, that's a CHARSXP (not a STRSXP), so it cannot be ALTREP, unless I am misremembering something.

Yeah, I agree that ALTREP(s) here is not right.

Are you trying to catch the non-UTF-8 strings here? Rf_translateCharUTF8() does not do anything (returns the same const char *) if the string is UTF-8, so you could always call it,

I will admit I'm not 100% certain what is going on here, but I catching non-UTF-8 strings here is what I thought it was doing.

and only call SET_STRING_ELT() if the returned pointer is different?

Oh interesting. Forgive my C++ naïveness, but would this be something like (i.e. does != check if the pointer is different>):

SEXP new_s = Rf_mkCharCE(Rf_translateCharUTF8(s); if (new_s != s) { SET_STRING_ELT(x, i, Rf_mkCharCE(Rf_translateCharUTF8(new_s), CE_UTF8)); }

Yeah, that C++ looks correct, except you don't need to translate twice:

SEXP new_s = Rf_mkCharCE(Rf_translateCharUTF8(s); if (new_s != s) { SET_STRING_ELT(x, i, Rf_mkCharCE(new_s)); }

However, I think the issue is that if x is ALTREP, then it does not matter if it is materialized of not, it is still ALTREP, so you still cannot change it with SET_STRING_ELT.

To avoid this, you'd need to call Rf_duplicate() on x. That should create a non-altrep copy of it.

You can always call Rf_duplicate(), or you can call it when you encounter a non-utf8 string. Note that you need to PROTECT() the result of Rf_duplicate()

Aaah thanks for those pointers! That looked like it worked, I'm going to trigger a benchmark run to see if there are any unintended consequences there (Hopefully no, but even if so, we might need to accept them!)

jonkeane · 2024-07-21T02:50:55Z

r/DESCRIPTION

@@ -62,6 +62,7 @@ Suggests:
    lubridate,
    pillar,
    pkgload,
+    readr,


Not sure the test is worth adding to suggests, though we will get rid of that annoying xref error if we do.

jonkeane · 2024-07-23T12:07:59Z

@ursabot please benchmark

ursabot · 2024-07-23T12:08:04Z

Benchmark runs are scheduled for commit 2a0da1e. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

nealrichardson · 2024-07-23T13:21:21Z

r/src/arrow_cpp11.h

+  // ensure that x is not actually altrep first
+  if (ALTREP(x)) {
+    x = PROTECT(Rf_duplicate(x));
+    UNPROTECT(1);


I'm pretty sure this isn't right, doesn't the UNPROTECT need to go after you're done using the thing you PROTECTed?

I'm also not sure how the C-level PROTECT stuff interacts with unwind_protect.

Hmmm, I'm surprised I don't get any stack imbalance issues with it. I do get them if I don't have UNPROTECT or I call it unconditionally. But anyway I have a slightly different version incoming.

conbench-apache-arrow · 2024-07-23T18:06:52Z

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 2a0da1e.

There were 38 benchmark results with an error:

Pull Request Run on ec2-m5-4xlarge-us-east-2 at 2024-07-23 16:05:59Z
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-20, scale_factor=10
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-20, scale_factor=1
and 36 more (see the report linked below)

There were 11 benchmark results indicating a performance regression:

Pull Request Run on ec2-c6a-4xlarge-us-east-2 at 2024-07-23 13:12:31Z
- BenchmarkTemporalRounding (C++) with params=<CeilTemporal, non_zoned, round_10_week>/524288/0, source=cpp-micro, suite=arrow-compute-scalar-temporal-benchmark
- BenchmarkTemporalRounding (C++) with params=<FloorTemporal, non_zoned, round_1_week>/524288/100, source=cpp-micro, suite=arrow-compute-scalar-temporal-benchmark
and 9 more (see the report linked below)

The full Conbench report has more details.

jonkeane · 2024-07-24T01:56:55Z

@github-actions crossbow submit -g r

github-actions · 2024-07-24T01:59:28Z

Revision: 206e94d

Submitted crossbow builds: ursacomputing/crossbow @ actions-6de7c6d005

Task	Status
r-binary-packages
test-r-arrow-backwards-compatibility
test-r-clang-sanitizer
test-r-depsource-bundled
test-r-depsource-system
test-r-dev-duckdb
test-r-devdocs
test-r-gcc-11
test-r-gcc-12
test-r-install-local
test-r-install-local-minsizerel
test-r-linux-as-cran
test-r-linux-rchk
test-r-linux-valgrind
test-r-minimal-build
test-r-offline-maximal
test-r-offline-minimal
test-r-rhub-debian-gcc-devel-lto-latest
test-r-rhub-debian-gcc-release-custom-ccache
test-r-rhub-ubuntu-release-latest
test-r-rocker-r-ver-latest
test-r-rstudio-r-base-4.1-opensuse155
test-r-rstudio-r-base-4.2-focal
test-r-ubuntu-22.04
test-r-versions
test-ubuntu-r-sanitizer

nealrichardson · 2024-07-24T12:37:45Z

r/src/arrow_cpp11.h

+    // ensure that x is not actually altrep first
+    bool was_altrep = ALTREP(x);
+    if (was_altrep) {
+      x = PROTECT(Rf_duplicate(x));


Add a comment about why we have to duplicate?

Yeah, I'll expand what I have there

nealrichardson · 2024-07-24T12:42:10Z

r/src/arrow_cpp11.h

@@ -152,6 +157,9 @@ inline SEXP utf8_strings(SEXP x) {
        SET_STRING_ELT(x, i, Rf_mkCharCE(Rf_translateCharUTF8(s), CE_UTF8));


Did we want to check whether Rf_translateCharUTF8() actually modified anything? Or do we trust that SET_STRING_ELT is a no-op in that case? I would imagine that in most cases, we already have ascii/utf-8 strings, so this whole function should be basically free. That should be easily verified by microbenchmarking.

Something like this from above?

SEXP new_s = Rf_translateCharUTF8(s);
if (new_s == s) {
SET_STRING_ELT(x, i, Rf_mkCharCE(new_s, CE_UTF8));
}

Yeah, that's probably good. It'll also make this slightly more inline with what was there before (checking that not utf8)

I've tried this, but after getting it to work with a bit of type faffing, doing the translate first we get errors with our utf string tests again.

### Rationale for this change To resolve the reverse dependency issue with `parquetize` ### What changes are included in this PR? One step towards resolving the issue ### Are these changes tested? yes ### Are there any user-facing changes? no * GitHub Issue: #43349 Authored-by: Jonathan Keane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>

conbench-apache-arrow · 2024-07-28T01:22:13Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 187197c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 28 possible false positives for unstable benchmarks that are known to sometimes produce them.

check if altrep

2a7eee6

jonkeane requested a review from thisisnic as a code owner July 21, 2024 02:46

github-actions bot added Component: R awaiting committer review Awaiting committer review labels Jul 21, 2024

jonkeane commented Jul 21, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 21, 2024

ensure altrep strings aren't altrep first

2a0da1e

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 23, 2024

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 23, 2024

nealrichardson reviewed Jul 23, 2024

View reviewed changes

a slightly different setup

206e94d

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 23, 2024

nealrichardson reviewed Jul 24, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 24, 2024

more comment

a22cb2e

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 25, 2024

github-actions bot added the awaiting changes Awaiting changes label Jul 25, 2024

An even simpler reprex

2a57460

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 27, 2024

ugh lintr

990e2cf

jonkeane merged commit 187197c into apache:main Jul 27, 2024
10 checks passed

jonkeane removed the awaiting change review Awaiting change review label Jul 27, 2024

jonkeane mentioned this pull request Jul 27, 2024

[R] String columns read lazily from readr error when transferred to an arrow table #43349

Closed

jonkeane mentioned this pull request Jul 27, 2024

[R] CRAN packaging checklist for version 17.0.0 #43317

Closed

38 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-43349: [R] Fix altrep string columns from readr #43351

GH-43349: [R] Fix altrep string columns from readr #43351

jonkeane commented Jul 21, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Jul 21, 2024

jonkeane Jul 21, 2024

jonkeane Jul 21, 2024

nealrichardson Jul 22, 2024

gaborcsardi Jul 22, 2024

jonkeane Jul 22, 2024

gaborcsardi Jul 22, 2024 •

edited

Loading

jonkeane Jul 23, 2024

jonkeane Jul 21, 2024

jonkeane commented Jul 23, 2024

ursabot commented Jul 23, 2024

nealrichardson Jul 23, 2024

jonkeane Jul 23, 2024

conbench-apache-arrow bot commented Jul 23, 2024

jonkeane commented Jul 24, 2024

github-actions bot commented Jul 24, 2024

nealrichardson Jul 24, 2024

jonkeane Jul 24, 2024

nealrichardson Jul 24, 2024

jonkeane Jul 24, 2024

jonkeane Jul 25, 2024

conbench-apache-arrow bot commented Jul 28, 2024

	test_that("We handle non-UTF strings", {
	x <- iconv("Veitingastaðir", to = "latin1")
	df <- tibble::tibble(
	chr = x,
	fct = as.factor(x)
	)
	names(df) <- iconv(paste(x, names(df), sep = "_"), to = "latin1")
	df_struct <- tibble::tibble(a = df)

	raw_schema <- list(utf8(), dictionary(int8(), utf8()))
	names(raw_schema) <- names(df)

	# Confirm setup
	expect_identical(Encoding(x), "latin1")
	expect_identical(Encoding(names(df)), c("latin1", "latin1"))
	expect_identical(Encoding(df[[1]]), "latin1")
	expect_identical(Encoding(levels(df[[2]])), "latin1")

	# Array
	expect_identical(as.vector(Array$create(x)), x)

		@@ -152,6 +157,9 @@ inline SEXP utf8_strings(SEXP x) {
		SET_STRING_ELT(x, i, Rf_mkCharCE(Rf_translateCharUTF8(s), CE_UTF8));

GH-43349: [R] Fix altrep string columns from readr #43351

GH-43349: [R] Fix altrep string columns from readr #43351

Conversation

jonkeane commented Jul 21, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jul 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaborcsardi Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonkeane commented Jul 23, 2024

ursabot commented Jul 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Jul 23, 2024

jonkeane commented Jul 24, 2024

github-actions bot commented Jul 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Jul 28, 2024

jonkeane commented Jul 21, 2024 •

edited by github-actions bot

Loading

gaborcsardi Jul 22, 2024 •

edited

Loading