-
-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] utf-8 misdetected as cp1256 #174
Comments
The file you provided is detected as
{
"path": "/home/ahmed/file.txt",
"encoding": "utf_8",
"encoding_aliases": [
"u8",
"utf",
"utf8",
"utf8_ucs2",
"utf8_ucs4",
"cp65001"
],
"alternative_encodings": [],
"language": "Unknown",
"alphabets": [
"Arabic",
"Basic Latin",
"CJK Unified Ideographs",
"Control character",
"Cyrillic",
"Hebrew",
"Latin Extended-A",
"Latin-1 Supplement",
"Mathematical Operators"
],
"has_sig_or_bom": false,
"chaos": 0.0,
"coherence": 0.0,
"unicode_path": null,
"is_preferred": true
} By reading the JSON output you've given, I suspect that the language detector find a near perfect match for I cannot do anything without the original file, feel free to pass it through mail directly. |
Even if I download it here, I get the same results. Here it is with verbose flag:
The online service at https://charsetnormalizerweb-ousret.vercel.app/ also detects it as cp1256. The zipped file can be downloaded here: https://tmp.cihar.com/file.zip |
After looking deeper at it, the problem is probably that
The issue seems that "multi-byte bad cutting detector and adjustment" only fixes errors at the beginning of the chunk, but not in the end. |
This avoids issues with detecting string boundaries while improving performace (avoids multiple decoding of the buffer). Fixes jawah#174
This avoids issues with detecting string boundaries while improving performance (avoids multiple decoding of the sequence). Fixes jawah#174
This avoids issues with detecting string boundaries while improving performance (avoids multiple decoding of the sequence). Fixes jawah#174
This avoids issues with detecting string boundaries while improving performance (avoids multiple decoding of the sequence). Fixes jawah#174
My bad, the file was accidentally modified during in-flight download (my side). This file seems to be particularly challenging for a charset-detector. Chunk extractionHere are the chunks extracted (some of them): First:
Fourth one:
Lastly:
The immediate thing that can be observed is that there isn't much to observe in it. Language-wise. Mess-detectorThe first pass immediately trigger the
And the language detection fail to detect any suitable match.. Here are the "word" that are considered too suspicious. And I have to agree with it.
So, now we have more material to assess what is going on. erratum: I can see that you've taken the time to find a solution, I will look at it. |
My PR is merely a workaround for short sequences (it reuses decoded_payload and splits that instead of decoding it again). I've also added more real-world test file in the PR. |
* Re-use decoded buffer for short texts This avoids issues with detecting string boundaries while improving performance (avoids multiple decoding of the sequence). Fixes #174 * 🔖 Bump version to 2.1.0.dev0 * 🐛 Workaround a potential bug in Python isspace table character bug discovered in Python, Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space. Co-authored-by: TAHRI Ahmed R <[email protected]> Co-authored-by: Ahmed TAHRI <[email protected]>
Describe the bug
File is detected as cp1256 while it is acutally utf-8.
To Reproduce
file.txt (the file is anonymized for privacy reasons)
Expected behavior
utf-8 should be detected.
Logs
Desktop (please complete the following information):
Additional context
chardet works fine on this file:
The text was updated successfully, but these errors were encountered: