readstat not converting encoding of sas7bcat labels #152

ofajardo · 2018-10-11T16:03:41Z

Sas7bcat labels with special characters are not correctly translated into UTF-8.

As an example coming from this R-Haven issue, when reading the file "formats.sas7bcat" coming from here, I get labels like "modalit\xe9 \xe01", which are not valid UTF8 but are valid windows-1252 or latin1. The thing is that readstat correctly sets the file.encoding to windows-1252, so that string should be already valid UTF-8 when my function readstat_value_label_handler gets it. This happens in pyreadstat, in R-Haven and debugging readstat with gdb. An user found a similar issue in pyreadstat for another file of his.

Looking at readstat_sas7bcat_read.c, in the function sas7bcat_parse_value_labels, it seems to me that the variable label never gets converted. I inserted the following after line 91 and cures the problem:

       const char *label = &lbp2[10]; // this is line 91
        //added! 20181011
        char *label2[label_len];
        retval = readstat_convert(label2, sizeof(label2),
                    label, label_len, ctx->converter);
        if (retval != READSTAT_OK)
                goto cleanup;

As my understanding of readstat and iconv is still low (hope to improve it!) I am not sure if this is the proper solution, and therefore I did not dare to send a PR, but I can do after your suggestions.

Another smaller, but still confusing thing is that if I set the encoding manually with readstat_set_file_character_encoding, to let's say something like LATIN1, and later I want to recover the file encoding with readstat_get_file_encoding, I still get WINDOWS-1252. The reason for this I think is because in readstat_sas7bcat_read.c line 371:

.file_encoding = hinfo->encoding

should be:

.file_encoding = ctx->input_encoding

as it is in readstat_sas7bdat_read.c line 594, to reflect that the user set the encoding manually.

The text was updated successfully, but these errors were encountered:

evanmiller · 2018-10-11T17:31:25Z

Hi, thanks for the report. The issue should be fixed now, please update and try with the latest code.

ofajardo · 2018-10-11T18:54:48Z

Thanks!

What about the other proposal? It would be good in order to have consistency between reading sas7bdat and sas7bcat

evanmiller · 2018-10-11T19:13:01Z

Okay, I’ve changed the behavior to match SAS7BDAT. Thanks.

…

On Oct 11, 2018, at 11:54, Otto Fajardo ***@***.***> wrote: Thanks! What about the other proposal? It would be good in order to have consistency between reading sas7bdat and sas7bcat — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#152 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIONzLEkWJidG9clye3sgVsOi8pmOqxks5uj5P4gaJpZM4XX30P>.

ofajardo · 2018-10-11T20:10:09Z

Awesome! Thanks again!

ofajardo · 2018-10-16T11:10:34Z

This correction will probably solve this issue on Haven

ofajardo mentioned this issue Oct 11, 2018

UnicodeDecodeError when reading value labels Roche/pyreadstat#4

Closed

evanmiller closed this as completed in 8666c56 Oct 11, 2018

ofajardo mentioned this issue Oct 11, 2018

read_sas: encoding of .sas7bdat tidyverse/haven#394

Closed

evanmiller added the SAS label Apr 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readstat not converting encoding of sas7bcat labels #152

readstat not converting encoding of sas7bcat labels #152

ofajardo commented Oct 11, 2018 •

edited

Loading

evanmiller commented Oct 11, 2018

ofajardo commented Oct 11, 2018

evanmiller commented Oct 11, 2018 via email

ofajardo commented Oct 11, 2018

ofajardo commented Oct 16, 2018

readstat not converting encoding of sas7bcat labels #152

readstat not converting encoding of sas7bcat labels #152

Comments

ofajardo commented Oct 11, 2018 • edited Loading

evanmiller commented Oct 11, 2018

ofajardo commented Oct 11, 2018

evanmiller commented Oct 11, 2018 via email

ofajardo commented Oct 11, 2018

ofajardo commented Oct 16, 2018

ofajardo commented Oct 11, 2018 •

edited

Loading