-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43349: [R] Fix altrep string columns from readr #43351
Conversation
|
r/src/arrow_cpp11.h
Outdated
@@ -148,7 +148,7 @@ inline SEXP utf8_strings(SEXP x) { | |||
|
|||
for (R_xlen_t i = 0; i < n; i++, ++p_x) { | |||
SEXP s = *p_x; | |||
if (s != NA_STRING) { | |||
if (s != NA_STRING && ALTREP(s)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not actually work yet. With this
arrow/r/tests/testthat/test-utf.R
Lines 19 to 38 in c3ebdf5
test_that("We handle non-UTF strings", { | |
x <- iconv("Veitingastaðir", to = "latin1") | |
df <- tibble::tibble( | |
chr = x, | |
fct = as.factor(x) | |
) | |
names(df) <- iconv(paste(x, names(df), sep = "_"), to = "latin1") | |
df_struct <- tibble::tibble(a = df) | |
raw_schema <- list(utf8(), dictionary(int8(), utf8())) | |
names(raw_schema) <- names(df) | |
# Confirm setup | |
expect_identical(Encoding(x), "latin1") | |
expect_identical(Encoding(names(df)), c("latin1", "latin1")) | |
expect_identical(Encoding(df[[1]]), "latin1") | |
expect_identical(Encoding(levels(df[[2]])), "latin1") | |
# Array | |
expect_identical(as.vector(Array$create(x)), x) |
Before #43173 this line was:
if (s != NA_STRING && !IS_UTF8(s) && !IS_ASCII(s)) {
I suspect what's going on is that we caught vroom altrep vectors with this if and called SET_STRING_ELT
so we didn't get the No Set_elt found for ALTSTRING
error. But which ALTREP(s)
we detect the non-utf strings here too and attempt to SET_STRING_ELT
when we shouldn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts @nealrichardson ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't understand what's happening in here. L144 says that this should no longer be altrep by the time we get here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that STRING_PTR_RO()
should materialize an ALTREP character vector, otherwise you can't iterate over the CHARSXP
pointers.
Also, ALTREP(s)
does not seem to make sense to me, that's a CHARSXP
(not a STRSXP
), so it cannot be ALTREP, unless I am misremembering something.
Are you trying to catch the non-UTF-8 strings here? Rf_translateCharUTF8()
does not do anything (returns the same const char *
) if the string is UTF-8, so you could always call it, and only call SET_STRING_ELT()
if the returned pointer is different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that STRING_PTR_RO() should materialize an ALTREP character vector, otherwise you can't iterate over the CHARSXP pointers.
Hmm, so maybe the actual problem here is that that isn't working on the ALTREP vectors coming from vroom? Here is the error we started seeing when we changed this condition from s != NA_STRING && !IS_UTF8(s) && !IS_ASCII(s)
to s != NA_STRING
:
Error in Table__from_dots(dots, schema, option_use_threads()): No Set_elt found for ALTSTRING class [class: vroom_chr, pkg: vroom]
Also, ALTREP(s) does not seem to make sense to me, that's a CHARSXP (not a STRSXP), so it cannot be ALTREP, unless I am misremembering something.
Yeah, I agree that ALTREP(s)
here is not right.
Are you trying to catch the non-UTF-8 strings here? Rf_translateCharUTF8() does not do anything (returns the same const char *) if the string is UTF-8, so you could always call it,
I will admit I'm not 100% certain what is going on here, but I catching non-UTF-8 strings here is what I thought it was doing.
and only call SET_STRING_ELT() if the returned pointer is different?
Oh interesting. Forgive my C++ naïveness, but would this be something like (i.e. does !=
check if the pointer is different>):
SEXP new_s = Rf_mkCharCE(Rf_translateCharUTF8(s);
if (new_s != s) {
SET_STRING_ELT(x, i, Rf_mkCharCE(Rf_translateCharUTF8(new_s), CE_UTF8));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that C++ looks correct, except you don't need to translate twice:
SEXP new_s = Rf_mkCharCE(Rf_translateCharUTF8(s);
if (new_s != s) {
SET_STRING_ELT(x, i, Rf_mkCharCE(new_s));
}
However, I think the issue is that if x
is ALTREP, then it does not matter if it is materialized of not, it is still ALTREP, so you still cannot change it with SET_STRING_ELT
.
To avoid this, you'd need to call Rf_duplicate()
on x
. That should create a non-altrep copy of it.
You can always call Rf_duplicate()
, or you can call it when you encounter a non-utf8 string. Note that you need to PROTECT()
the result of Rf_duplicate()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaah thanks for those pointers! That looked like it worked, I'm going to trigger a benchmark run to see if there are any unintended consequences there (Hopefully no, but even if so, we might need to accept them!)
r/DESCRIPTION
Outdated
@@ -62,6 +62,7 @@ Suggests: | |||
lubridate, | |||
pillar, | |||
pkgload, | |||
readr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure the test is worth adding to suggests, though we will get rid of that annoying xref error if we do.
@ursabot please benchmark |
Benchmark runs are scheduled for commit 2a0da1e. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
r/src/arrow_cpp11.h
Outdated
// ensure that x is not actually altrep first | ||
if (ALTREP(x)) { | ||
x = PROTECT(Rf_duplicate(x)); | ||
UNPROTECT(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this isn't right, doesn't the UNPROTECT
need to go after you're done using the thing you PROTECT
ed?
I'm also not sure how the C-level PROTECT stuff interacts with unwind_protect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I'm surprised I don't get any stack imbalance issues with it. I do get them if I don't have UNPROTECT
or I call it unconditionally. But anyway I have a slightly different version incoming.
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 2a0da1e. There were 38 benchmark results with an error:
There were 11 benchmark results indicating a performance regression:
The full Conbench report has more details. |
@github-actions crossbow submit -g r |
Revision: 206e94d Submitted crossbow builds: ursacomputing/crossbow @ actions-6de7c6d005 |
// ensure that x is not actually altrep first | ||
bool was_altrep = ALTREP(x); | ||
if (was_altrep) { | ||
x = PROTECT(Rf_duplicate(x)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment about why we have to duplicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll expand what I have there
@@ -152,6 +157,9 @@ inline SEXP utf8_strings(SEXP x) { | |||
SET_STRING_ELT(x, i, Rf_mkCharCE(Rf_translateCharUTF8(s), CE_UTF8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we want to check whether Rf_translateCharUTF8()
actually modified anything? Or do we trust that SET_STRING_ELT
is a no-op in that case? I would imagine that in most cases, we already have ascii/utf-8 strings, so this whole function should be basically free. That should be easily verified by microbenchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this from above?
SEXP new_s = Rf_translateCharUTF8(s);
if (new_s == s) {
SET_STRING_ELT(x, i, Rf_mkCharCE(new_s, CE_UTF8));
}
Yeah, that's probably good. It'll also make this slightly more inline with what was there before (checking that not utf8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried this, but after getting it to work with a bit of type faffing, doing the translate first we get errors with our utf string tests again.
### Rationale for this change To resolve the reverse dependency issue with `parquetize` ### What changes are included in this PR? One step towards resolving the issue ### Are these changes tested? yes ### Are there any user-facing changes? no * GitHub Issue: #43349 Authored-by: Jonathan Keane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 187197c. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 28 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
To resolve the reverse dependency issue with
parquetize
What changes are included in this PR?
One step towards resolving the issue
Are these changes tested?
yes
Are there any user-facing changes?
no