Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DETECTION] no encoding found, contrarily to chardet and cchardet #104

Closed
adbar opened this issue Sep 17, 2021 · 2 comments
Closed

[DETECTION] no encoding found, contrarily to chardet and cchardet #104

adbar opened this issue Sep 17, 2021 · 2 comments
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed

Comments

@adbar
Copy link
Contributor

adbar commented Sep 17, 2021

Notice
I hereby announce that my raw input is not :

  • Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
  • Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file
A accessible way of retrieving the file concerned. Host it somewhere with untouched encoding.

Verbose output

2021-09-17 13:08:23,491 | INFO | Detected declarative mark in sequence. Priority +1 given for utf_8.
2021-09-17 13:08:23,491 | WARNING | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0xe4 in position 2531: invalid continuation byte
2021-09-17 13:08:23,492 | WARNING | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xe4 in position 2531: ordinal not in range(128)
2021-09-17 13:08:23,493 | WARNING | Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0xfc in position 27411: illegal multibyte sequence
2021-09-17 13:08:23,494 | WARNING | Code page big5hkscs does not fit given bytes sequence at ALL. 'big5hkscs' codec can't decode byte 0xa0 in position 161824: illegal multibyte sequence
2021-09-17 13:08:23,496 | WARNING | cp037 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 526.200000 %.
2021-09-17 13:08:23,496 | WARNING | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,508 | WARNING | cp1125 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,509 | WARNING | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,509 | WARNING | cp1250 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,510 | WARNING | cp1251 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,511 | WARNING | cp1252 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,512 | WARNING | cp1253 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,513 | WARNING | cp1254 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,513 | WARNING | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 27411: character maps to <undefined>
2021-09-17 13:08:23,514 | WARNING | cp1256 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,515 | WARNING | cp1257 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,515 | WARNING | cp1258 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,516 | WARNING | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,517 | WARNING | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x70 in position 31: character maps to <undefined>
2021-09-17 13:08:23,517 | WARNING | cp437 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,518 | WARNING | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,519 | WARNING | cp775 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,520 | WARNING | cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,521 | WARNING | cp852 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,522 | WARNING | cp855 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,523 | WARNING | cp857 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,524 | WARNING | cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,525 | WARNING | cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,525 | WARNING | cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,526 | WARNING | cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,526 | WARNING | cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,527 | WARNING | cp864 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,528 | WARNING | cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,528 | WARNING | cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,529 | WARNING | Code page cp869 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x84 in position 187587: character maps to <undefined>
2021-09-17 13:08:23,530 | WARNING | Code page cp932 does not fit given bytes sequence at ALL. 'cp932' codec can't decode byte 0xfc in position 27411: illegal multibyte sequence
2021-09-17 13:08:23,530 | WARNING | Code page cp949 does not fit given bytes sequence at ALL. 'cp949' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,531 | WARNING | Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0xfc in position 27411: illegal multibyte sequence
2021-09-17 13:08:23,531 | WARNING | Code page euc_jis_2004 does not fit given bytes sequence at ALL. 'euc_jis_2004' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,531 | WARNING | Code page euc_jisx0213 does not fit given bytes sequence at ALL. 'euc_jisx0213' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,532 | WARNING | Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,532 | WARNING | Code page euc_kr does not fit given bytes sequence at ALL. 'euc_kr' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,533 | WARNING | Code page gb18030 does not fit given bytes sequence at ALL. 'gb18030' codec can't decode byte 0xa0 in position 161824: illegal multibyte sequence
2021-09-17 13:08:23,533 | WARNING | Code page gb2312 does not fit given bytes sequence at ALL. 'gb2312' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,534 | WARNING | Code page gbk does not fit given bytes sequence at ALL. 'gbk' codec can't decode byte 0xa0 in position 161824: illegal multibyte sequence
2021-09-17 13:08:23,535 | WARNING | hp_roman8 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,535 | WARNING | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,536 | WARNING | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,536 | WARNING | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,536 | WARNING | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,537 | WARNING | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,537 | WARNING | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,537 | WARNING | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,538 | WARNING | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,538 | WARNING | iso8859_10 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,539 | WARNING | Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 27411: character maps to <undefined>
2021-09-17 13:08:23,539 | WARNING | iso8859_13 is deemed too similar to code page cp1257 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,540 | WARNING | iso8859_14 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,541 | WARNING | iso8859_15 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,541 | WARNING | iso8859_16 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,542 | WARNING | iso8859_2 is deemed too similar to code page cp1250 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,543 | WARNING | iso8859_3 is deemed too similar to code page iso8859_16 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,543 | WARNING | iso8859_4 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,544 | WARNING | iso8859_5 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,544 | WARNING | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 27411: character maps to <undefined>
2021-09-17 13:08:23,545 | WARNING | iso8859_7 is deemed too similar to code page cp1253 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,545 | WARNING | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd6 in position 22977: character maps to <undefined>
2021-09-17 13:08:23,546 | WARNING | iso8859_9 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,546 | WARNING | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xd6 in position 22977: illegal multibyte sequence
2021-09-17 13:08:23,547 | WARNING | koi8_r was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,547 | WARNING | kz1048 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,547 | WARNING | latin_1 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,548 | WARNING | mac_cyrillic was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,549 | WARNING | mac_greek was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,549 | WARNING | mac_iceland was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,550 | WARNING | mac_latin2 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,551 | WARNING | mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2021-09-17 13:08:23,551 | WARNING | mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2021-09-17 13:08:23,552 | WARNING | ptcp154 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,552 | WARNING | Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0xfc in position 27411: illegal multibyte sequence
2021-09-17 13:08:23,554 | WARNING | Code page shift_jis_2004 does not fit given bytes sequence at ALL. 'shift_jis_2004' codec can't decode byte 0xa0 in position 161824: illegal multibyte sequence
2021-09-17 13:08:23,555 | WARNING | Code page shift_jisx0213 does not fit given bytes sequence at ALL. 'shift_jisx0213' codec can't decode byte 0xa0 in position 161824: illegal multibyte sequence
2021-09-17 13:08:23,556 | WARNING | Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 27411: character maps to <undefined>
2021-09-17 13:08:23,556 | INFO | Encoding utf_16 wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2021-09-17 13:08:23,556 | WARNING | Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode bytes in position 161318-161319: illegal encoding
2021-09-17 13:08:23,556 | WARNING | Code page utf_16_le does not fit given bytes sequence at ALL. 'utf-16-le' codec can't decode bytes in position 161560-161561: illegal encoding
2021-09-17 13:08:23,556 | INFO | Encoding utf_32 wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2021-09-17 13:08:23,557 | WARNING | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2021-09-17 13:08:23,557 | WARNING | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2021-09-17 13:08:23,557 | WARNING | Code page utf_7 does not fit given bytes sequence at ALL. 'utf7' codec can't decode byte 0xe4 in position 2531: unexpected special character
Unable to identify originating encoding for "anzeige-value-stars-mit-ausgewaehlten-aktien-den-dax-schlagen-5873873". Maybe try increasing maximum amount of chaos.
{
    "path": "/home/adbar/anzeige-value-stars-mit-ausgewaehlten-aktien-den-dax-schlagen-5873873",
    "encoding": null,
    "encoding_aliases": [],
    "alternative_encodings": [],
    "language": "Unknown",
    "alphabets": [],
    "has_sig_or_bom": false,
    "chaos": 1.0,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding

chardet and cchardet both agree on windows-1252 but I'm not certain.

Desktop (please complete the following information):

  • OS: Linux
  • Python version 3.6.9
  • Package version 2.0.5

Additional context

Your package looks nice! I'm currently testing it with edge cases, i.e. HTML documents with strange or inconsistent encodings.

The issue is also referenced here: adbar/trafilatura#79

@adbar adbar added detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed labels Sep 17, 2021
@Ousret
Copy link
Member

Ousret commented Sep 17, 2021

Hi

I could reproduce your result. Thanks for the detailed report. It helped.
But I noticed that the target website changed its content based on the HTTP client engine.

I could identify what the engine detected as "mess" and corrected it. Here is the PR that will correct it. #106

Now, the result that you got from Chardet and cChardet is technically correct and decode the content correctly but the result will be identical if you use windows-1250 instead.

Here is what you should get when using the dev-master version.

{
    "path": "/home/ahmed/PycharmProjects/charset_normalizer/ppp.txt",
    "encoding": "cp1250",
    "encoding_aliases": [
        "1250",
        "windows_1250"
    ],
    "alternative_encodings": [
        "cp1252",
        "cp1254",
        "cp1257",
        "cp1258"
    ],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "General Punctuation",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 5.383,
    "coherence": 100.0,
    "unicode_path": null,
    "is_preferred": true
}

Ousret added a commit to Ousret/char-dataset that referenced this issue Sep 17, 2021
@adbar
Copy link
Contributor Author

adbar commented Sep 20, 2021

Hi, it works, thanks!

@adbar adbar closed this as completed Sep 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants