rtf_to_text() converts RTF cp1252 russian text bad #50

svladimirs · 2023-12-18T07:25:23Z

striprtf 0.0.26

{\rtf1\ansi\ansicpg1251
{\rtf1\adeflang1025\ansi\ansicpg1251
rtf_to_text() converting RTFs cp1251 is well (Russian text).

{\rtf1\adeflang1025\ansi\ansicpg1252
But not cp1252:
абвгдеёжзийклмнопрст -> àáâãäå¸æçèéêëìíîïðñò

encoding=... do not help.

This helps:
https://ru.stackoverflow.com/questions/1145225/Ошибка-обработки-файлов-rtf-на-python?ysclid=lqagyqz7x5798462943
or
rtf_to_text(rtf.read()).encode('cp1252').decode('ansi')
test-rus.zip

joshy · 2023-12-18T16:08:08Z

Hi, according to wikipedia cyrilic rtf should be encoded in cp1251 and not in cp1252. If I change the rtf content to cp1251 it works fine. cp1252 is the western encoding.

svladimirs · 2023-12-19T01:47:24Z

MS Word 2016 (test-2016.zip) save with 1251, but new MS Word 2021 (or below) after 2016 save as 1252 (test-rus).

stevengj · 2023-12-19T16:32:59Z

If a file (whether it's RTF or any other encoding) lists the wrong encoding, you are going to get mojibake … I don't think there's anything striprtf can realistically do about buggy RTF files.

joshy · 2023-12-20T08:36:51Z

I have created a small test case myself with word 365 and indeed it saves it with encoding 1252. I have no idea how in this case word finds out which is the right encoding. Some online rtf viewers (https://products.groupdocs.app/de/viewer/rtf, https://jumpshare.com/viewer/rtf) are also able to display the content correctly. Also Wordpad shows it correctly. The question is how do they figure out the right encoding?

svladimirs · 2023-12-20T09:16:35Z

Thanks. I did like this:
decoded = rtf_to_text(rtf)
try:
decoded = decoded.encode('cp1252').decode('ansi')
except:
pass

joshy · 2023-12-20T10:31:48Z

@svladimirs: Glad you got a workaround. If I am running your code I get: LookupError: unknown encoding: ansi. How can this run?

stevengj · 2023-12-20T13:34:41Z

The question is how do they figure out the right encoding?

Maybe they do charset detection?

joshy · 2023-12-20T14:51:32Z

The question is how do they figure out the right encoding?

Maybe they do charset detection?

I tried the chardet library and it told me with nearly 80% confidence that the encoding is ISO-8859-8 which is Hebrew.
What I tried:

y = 'àáâãäå¸æçèéêëìíîïðñò\n'.encode('cp1252')
import chardet
chardet.detect(y)
>>>{'encoding': 'ISO-8859-8', 'confidence': 0.7950708952163513, 'language': 'Hebrew'}

svladimirs · 2023-12-21T02:25:10Z

@joshy, You're probably using python < 3.6. See 7.2.4.1. Text Encodings.
https://docs.python.org/3.5/library/codecs.html
https://docs.python.org/3.6/library/codecs.html
Let's replace ansi -> mbcs.

From that link on stackoverflow author used .encode('iso-8859-1').decode('cp1251'), but I tried to write universal code.
'iso-8859-1' replaced by me to 'cp1252' because def rtf_to_text(text, encoding="cp1252", errors="strict").

https://learn.microsoft.com/en-us/cpp/text/support-for-multibyte-character-sets-mbcss?view=msvc-170
"For platforms used in markets whose languages use large character sets, the best alternative to Unicode is MBCS".
I thought ansi (mbcs) would be more versatile than cp1251.

What do you think about: def rtf_to_text(text, encoding="mbcs", errors="strict"). Will it work?
Maybe the problem with 1251/1252 will go away?

joshy · 2023-12-26T21:25:02Z

@svladimirs As you can see I am using python 3.9.

Regarding to your proposals:

mbcs is windows only, so not really an option
def rtf_to_text(text, encoding="mbcs", errors="strict") is only used as a proposal. If there is an encoding in the file itself, like in newer versions of word, the encoding used in the rtf file is taken

svladimirs · 2023-12-29T03:13:35Z

Well, mbcs won't work either...
Then:
decoded = rtf_to_text(rtf)
try:
decoded = decoded.encode('cp1252').decode('cp1251')
except:
pass

joshy · 2024-01-09T08:17:07Z

As a library I can't do that, you as a user can do that. The reason is that the specified encoding in the rtf file is correct and the library would convert it to a wrong encoding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rtf_to_text() converts RTF cp1252 russian text bad #50

rtf_to_text() converts RTF cp1252 russian text bad #50

svladimirs commented Dec 18, 2023

joshy commented Dec 18, 2023

svladimirs commented Dec 19, 2023

stevengj commented Dec 19, 2023

joshy commented Dec 20, 2023

svladimirs commented Dec 20, 2023

joshy commented Dec 20, 2023

stevengj commented Dec 20, 2023

joshy commented Dec 20, 2023

svladimirs commented Dec 21, 2023

joshy commented Dec 26, 2023

svladimirs commented Dec 29, 2023

joshy commented Jan 9, 2024

rtf_to_text() converts RTF cp1252 russian text bad #50

rtf_to_text() converts RTF cp1252 russian text bad #50

Comments

svladimirs commented Dec 18, 2023

joshy commented Dec 18, 2023

svladimirs commented Dec 19, 2023

stevengj commented Dec 19, 2023

joshy commented Dec 20, 2023

svladimirs commented Dec 20, 2023

joshy commented Dec 20, 2023

stevengj commented Dec 20, 2023

joshy commented Dec 20, 2023

svladimirs commented Dec 21, 2023

joshy commented Dec 26, 2023

svladimirs commented Dec 29, 2023

joshy commented Jan 9, 2024