Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rtf_to_text() converts RTF cp1252 russian text bad #50

Open
svladimirs opened this issue Dec 18, 2023 · 12 comments
Open

rtf_to_text() converts RTF cp1252 russian text bad #50

svladimirs opened this issue Dec 18, 2023 · 12 comments

Comments

@svladimirs
Copy link

striprtf 0.0.26

{\rtf1\ansi\ansicpg1251
{\rtf1\adeflang1025\ansi\ansicpg1251
rtf_to_text() converting RTFs cp1251 is well (Russian text).

{\rtf1\adeflang1025\ansi\ansicpg1252
But not cp1252:
абвгдеёжзийклмнопрст -> àáâãäå¸æçèéêëìíîïðñò

encoding=... do not help.

This helps:
https://ru.stackoverflow.com/questions/1145225/Ошибка-обработки-файлов-rtf-на-python?ysclid=lqagyqz7x5798462943
or
rtf_to_text(rtf.read()).encode('cp1252').decode('ansi')
test-rus.zip

@joshy
Copy link
Owner

joshy commented Dec 18, 2023

Hi, according to wikipedia cyrilic rtf should be encoded in cp1251 and not in cp1252. If I change the rtf content to cp1251 it works fine. cp1252 is the western encoding.

@svladimirs
Copy link
Author

MS Word 2016 (test-2016.zip) save with 1251, but new MS Word 2021 (or below) after 2016 save as 1252 (test-rus).

@stevengj
Copy link

If a file (whether it's RTF or any other encoding) lists the wrong encoding, you are going to get mojibake … I don't think there's anything striprtf can realistically do about buggy RTF files.

@joshy
Copy link
Owner

joshy commented Dec 20, 2023

I have created a small test case myself with word 365 and indeed it saves it with encoding 1252. I have no idea how in this case word finds out which is the right encoding. Some online rtf viewers (https://products.groupdocs.app/de/viewer/rtf, https://jumpshare.com/viewer/rtf) are also able to display the content correctly. Also Wordpad shows it correctly. The question is how do they figure out the right encoding?

@svladimirs
Copy link
Author

Thanks. I did like this:
decoded = rtf_to_text(rtf)
try:
decoded = decoded.encode('cp1252').decode('ansi')
except:
pass

@joshy
Copy link
Owner

joshy commented Dec 20, 2023

@svladimirs: Glad you got a workaround. If I am running your code I get: LookupError: unknown encoding: ansi. How can this run?

@stevengj
Copy link

The question is how do they figure out the right encoding?

Maybe they do charset detection?

@joshy
Copy link
Owner

joshy commented Dec 20, 2023

The question is how do they figure out the right encoding?

Maybe they do charset detection?

I tried the chardet library and it told me with nearly 80% confidence that the encoding is ISO-8859-8 which is Hebrew.
What I tried:

y = 'àáâãäå¸æçèéêëìíîïðñò\n'.encode('cp1252')
import chardet
chardet.detect(y)
>>>{'encoding': 'ISO-8859-8', 'confidence': 0.7950708952163513, 'language': 'Hebrew'}

@svladimirs
Copy link
Author

@joshy, You're probably using python < 3.6. See 7.2.4.1. Text Encodings.
https://docs.python.org/3.5/library/codecs.html
https://docs.python.org/3.6/library/codecs.html
Let's replace ansi -> mbcs.

From that link on stackoverflow author used .encode('iso-8859-1').decode('cp1251'), but I tried to write universal code.
'iso-8859-1' replaced by me to 'cp1252' because def rtf_to_text(text, encoding="cp1252", errors="strict").

https://learn.microsoft.com/en-us/cpp/text/support-for-multibyte-character-sets-mbcss?view=msvc-170
"For platforms used in markets whose languages use large character sets, the best alternative to Unicode is MBCS".
I thought ansi (mbcs) would be more versatile than cp1251.

What do you think about: def rtf_to_text(text, encoding="mbcs", errors="strict"). Will it work?
Maybe the problem with 1251/1252 will go away?

@joshy
Copy link
Owner

joshy commented Dec 26, 2023

@svladimirs As you can see I am using python 3.9.
image

Regarding to your proposals:

  • mbcs is windows only, so not really an option
  • def rtf_to_text(text, encoding="mbcs", errors="strict") is only used as a proposal. If there is an encoding in the file itself, like in newer versions of word, the encoding used in the rtf file is taken

@svladimirs
Copy link
Author

Well, mbcs won't work either...
Then:
decoded = rtf_to_text(rtf)
try:
decoded = decoded.encode('cp1252').decode('cp1251')
except:
pass

@joshy
Copy link
Owner

joshy commented Jan 9, 2024

As a library I can't do that, you as a user can do that. The reason is that the specified encoding in the rtf file is correct and the library would convert it to a wrong encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants