Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cyrillic turned into chinese #29

Closed
bobert13 opened this issue Jan 5, 2022 · 3 comments
Closed

cyrillic turned into chinese #29

bobert13 opened this issue Jan 5, 2022 · 3 comments

Comments

@bobert13
Copy link

bobert13 commented Jan 5, 2022

Hi

I have 2 files in cyrillic. I can read both without issue in MS Word.
The first seems to work fine with:

with open(fullpath) as infile:
                content = infile.read()
                text = rtf_to_text(content ,'ignore')

The second (bad.zip) gets turned into chinese characters

good.zip
bad.zip

sample output from the good one:

>>> tabtext =text.split("|||")
>>> print(tabtext[0])
Таблиця розподілу номерного ресурсу
Кіровоградська область|
Код зони - 52

sample output from the bad one:

>>> tabtext =text.split("|")
>>> print(tabtext[0])
亦犭桷 痤顼钿畴 眍戾痦钽 疱耋瘃
它獬怦赅 钺豚耱鼃
暑 珙龛 - 32

if i leave out the "ignore", i get:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xff in position 6: illegal multibyte sequence

any idea how i can work around this?

@joshy joshy closed this as completed in b2e88aa Jan 5, 2022
@bobert13
Copy link
Author

bobert13 commented Jan 6, 2022 via email

@joshy
Copy link
Owner

joshy commented Jan 6, 2022

Hi,

yes the issue is fixed but until now there was no new version. Now you can upgrade you striprtf version (0.0.19) and it should work.

BR Joshy

@bobert13
Copy link
Author

bobert13 commented Jan 6, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants