Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle invalid multi-byte sequences in iconv encoding conversions #264

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

gorcha
Copy link
Contributor

@gorcha gorcha commented Feb 6, 2022

Hi @evanmiller,

This PR is another go at #252 (cc tidyverse/haven#615).

As suggested I've reworked the bad byte handler so it's called from surrounding code when a conversion fails instead of processing bytes inside readstat_convert(). I've implemented the handler in the string conversion code in sav_process_row() as an example, and there are a handful of bad byte handlers in readstat_convert.c.

The handler signature includes the src and dst variables from readstat_convert() as well as the observation index and variable object for error logging.

Thanks!

@gorcha
Copy link
Contributor Author

gorcha commented Mar 21, 2022

Hi @evanmiller! Just wondering if you've had a chance to look at this?

Would you prefer that I finalise the full PR before reviewing? As noted I've only implemented the handler in the code for processing sav string values but I can chuck it in to the others as well and non-draft the PR if that's easier.

Thanks!

@evanmiller
Copy link
Contributor

Hi, I like the overall approach. Maybe rename to invalid_string instead of bad_byte? My thinking is that the handlers are processing the entire strings rather than individual bytes.

@gorcha
Copy link
Contributor Author

gorcha commented Mar 21, 2022

Thanks! Good call, shall do 🙂

I'll make that change then add the handler in to the other reader functions and update the PR.

@gorcha gorcha marked this pull request as ready for review March 22, 2022 11:34
@gorcha
Copy link
Contributor Author

gorcha commented Mar 22, 2022

I've renamed the handler to invalid_string and added the handler into the row processing code for each of the readers, so should be good to go (i.e. ready for checking 😛)!

@evanmiller
Copy link
Contributor

Ok, looking good. Since this introduces a new API I will save it for the next minor version bump (ReadStat 1.2). Can you add some documentation (at least to the header file) about how C clients can use the new machinery? I don't think it's really obvious.

I'm also not sure that USER_ABORT is the right return code here since most of the handlers are system-provided rather than user-provided. Maybe a failed conversion should return BAD_STRING or perhaps a new error code.

@gorcha
Copy link
Contributor Author

gorcha commented Mar 22, 2022

No worries, I'll add a comment in the header. Given the weirdness it probably warrants a write-up with an example in the "Reading files" section of the README for some context. Do you think it needs a full example or is that unnecessary bloat?

Fair call re: the error message. I think READSTAT_ERROR_CONVERT_BAD_STRING is still appropriate given that it's still essentially an invalid byte sequence error, but I can see value in signalling the error differently when it has failed after going through the handler. What about a READSTAT_ERROR_INVALID_STRING_HANDLERwith an error message like "Unable to convert string to the requested encoding (could not recover from invalid byte sequence in the invalid string handler)"?

@gorcha
Copy link
Contributor Author

gorcha commented May 16, 2022

Hi @evanmiller, just wondering if you had any feedback on the last comment?

This is the way I'm leaning:

  • As well as the header comment, also add a new subsection to the README with a brief example
  • Just return READSTAT_ERROR_CONVERT_BAD_STRING instead of USER_ABORT if the handler returns READSTAT_HANDLER_ABORT instead of adding a new error type.

@deschen1
Copy link

Sorry for nudging...just curious about this bug and its fix, since it's been quite for some time now. I got hit again by the "invalid byte sequence" bug for another of my .sav files where currently the only solution is to switch back to R package haven with version <= 2.3.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants