Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readstat not converting encoding of sas7bcat labels #152

Closed
ofajardo opened this issue Oct 11, 2018 · 5 comments
Closed

readstat not converting encoding of sas7bcat labels #152

ofajardo opened this issue Oct 11, 2018 · 5 comments
Labels

Comments

@ofajardo
Copy link

ofajardo commented Oct 11, 2018

Sas7bcat labels with special characters are not correctly translated into UTF-8.

As an example coming from this R-Haven issue, when reading the file "formats.sas7bcat" coming from here, I get labels like "modalit\xe9 \xe01", which are not valid UTF8 but are valid windows-1252 or latin1. The thing is that readstat correctly sets the file.encoding to windows-1252, so that string should be already valid UTF-8 when my function readstat_value_label_handler gets it. This happens in pyreadstat, in R-Haven and debugging readstat with gdb. An user found a similar issue in pyreadstat for another file of his.

Looking at readstat_sas7bcat_read.c, in the function sas7bcat_parse_value_labels, it seems to me that the variable label never gets converted. I inserted the following after line 91 and cures the problem:

       const char *label = &lbp2[10]; // this is line 91
        //added! 20181011
        char *label2[label_len];
        retval = readstat_convert(label2, sizeof(label2),
                    label, label_len, ctx->converter);
        if (retval != READSTAT_OK)
                goto cleanup;

As my understanding of readstat and iconv is still low (hope to improve it!) I am not sure if this is the proper solution, and therefore I did not dare to send a PR, but I can do after your suggestions.

Another smaller, but still confusing thing is that if I set the encoding manually with readstat_set_file_character_encoding, to let's say something like LATIN1, and later I want to recover the file encoding with readstat_get_file_encoding, I still get WINDOWS-1252. The reason for this I think is because in readstat_sas7bcat_read.c line 371:

.file_encoding = hinfo->encoding

should be:

.file_encoding = ctx->input_encoding

as it is in readstat_sas7bdat_read.c line 594, to reflect that the user set the encoding manually.

@evanmiller
Copy link
Contributor

Hi, thanks for the report. The issue should be fixed now, please update and try with the latest code.

@ofajardo
Copy link
Author

Thanks!

What about the other proposal? It would be good in order to have consistency between reading sas7bdat and sas7bcat

@evanmiller
Copy link
Contributor

evanmiller commented Oct 11, 2018 via email

@ofajardo
Copy link
Author

Awesome! Thanks again!

@ofajardo
Copy link
Author

This correction will probably solve this issue on Haven

@evanmiller evanmiller added the SAS label Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants