Alternate file encodings throw UnicodeDecodeError #102

sbywater · 2022-09-13T17:29:07Z

Describe the bug

Python files that declare an alternate encoding throw a UnicodeDecodeError:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position n: invalid continuation byte

To Reproduce

Steps to reproduce the behavior:

For a file under consideration, add a declaration like # -*- coding: iso-8859-15 -*-
add a line such as my_string = 'é'
Run deptry

Expected behavior

These files should be parsed correctly.

The text was updated successfully, but these errors were encountered:

fpgmaas · 2022-09-13T17:44:32Z

Thanks for raising the issue. I have not worked with files with alternate encodings before, I will have a look and see if I can reproduce this and fix it tomorrow!

fpgmaas · 2022-09-13T18:12:17Z

Strange, I just tried to reproduce it but was not able to.

I added the following file and ran deptry:

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
my_string = """
Ax	NBSP	¡	¢	£	€	¥	Š	§	š	©	ª	«	¬	SHY	®	¯
Bx	°	±	²	³	Ž	µ	¶	·	ž	¹	º	»	Œ	œ	Ÿ	¿
Cx	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
Dx	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
Ex	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
Fx	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ
"""
import foo

deptry succesfully parsed this file and concluded that foo was a missing dependency.

So I think there is also a system-specific issue here? Maybe to avoid this error on all systems, we need to detect and explicitly specify specify the encoding while reading like shown [here](open('filename', encoding="ISO-8859-1")):

open('filename', encoding="ISO-8859-1")

But then we would need to detect the file encoding first.

Anyway, I do not have a lot of knowledge about encodings, so this might take me some time. Would also be good if I can find a way to reproduce this on my laptop. I will dive deeper into this issue tomorrow.

sbywater · 2022-09-13T19:16:14Z

Maybe you are using Windows, where ISO-8859-1 can be an assumed encoding?

System:

OS: Ubuntu 22.06
Language Version: Python 3.10

fpgmaas · 2022-09-14T07:11:56Z

I'm using macOS 12.3.1 and Python 3.9.

I think the issue should now be solved in release 0.4.6. From this version, deptry tries to identify the file-encoding before reading it using chardet, see here in the code and the corresponding unit tests. Please let me know if this resolves your issue.

sbywater · 2022-09-14T15:42:57Z

I've updated to 0.4.6. New error is UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1498: character maps to

Here is the stack trace:

File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 36, in get_imported_modules_from_file modules = self._get_imported_modules_from_py(path_to_file) File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 51, in _get_imported_modules_from_py root = ast.parse(f.read(), path_to_py_file) # type: ignore File "/usr/lib/python3.10/encodings/cp1254.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1498: character maps to <undefined>

fpgmaas · 2022-09-14T16:35:27Z

Sorry that the implemented solution did not solve your problem.

It seems that chardet identifies an incorrect encoding for the file. I guess the only possible solution left is to catch this error and log a warning to the user that the specific file will be omitted while scanning for imports, since AFAIK there is no other way to identify the encoding.

fpgmaas · 2022-09-14T16:57:11Z

@sbywater Would it be possible for you to create a reproducible example? I currently fail to reproduce the error. I am currently thinking of implementing the following:

simply parse the file
If UnicodeDecodeError: guess the encoding, then parse the file
If still UnicodeDecodeError, skip the file.

Which would look as follows.

    def _get_imported_modules_from_py(self, path_to_py_file: Path) -> List[str]:
        try:
            with open(path_to_py_file) as f:
                root = ast.parse(f.read(), path_to_py_file)  # type: ignore
            import_nodes = self._get_import_nodes_from(root)
            return self._get_import_modules_from(import_nodes)
        except UnicodeDecodeError:
            return self._get_imported_modules_from_py_and_guess_encoding(path_to_py_file)

    def _get_imported_modules_from_py_and_guess_encoding(self, path_to_py_file: Path) -> List[str]:
        try:
            with open(path_to_py_file, encoding=self._get_file_encoding(path_to_py_file)) as f:
                root = ast.parse(f.read(), path_to_py_file)  # type: ignore
            import_nodes = self._get_import_nodes_from(root)
            return self._get_import_modules_from(import_nodes)
        except UnicodeDecodeError:
            logging.warning(f"Warning: File {path_to_py_file} could not be decoded. Skipping...")
            return []

But I fail to write a unit test without being able to reproduce the error first.

sbywater · 2022-09-14T21:17:22Z

I can clarify now: the original problem file no longer throws an error. However, under 0.4.6 a file that worked before now throws the UnicodeDecodeError. The problem file does not declare a file encoding, and includes this code:

my_string = '🐺'

Let me know if you'd like me to create a new issue for this. Your proposed patch looks like a good solution.

Here is a verbose stack trace...

EUC-JP Japanese prober hit error at byte 374
EUC-KR Korean prober hit error at byte 374
CP949 Korean prober hit error at byte 374
Big5 Chinese prober hit error at byte 375
EUC-TW Taiwan prober hit error at byte 374
utf-8 not active
SHIFT_JIS Japanese confidence = 0.01
EUC-JP not active
GB2312 Chinese confidence = 0.01
EUC-KR not active
CP949 not active
Big5 not active
EUC-TW not active
Johab Korean confidence = 0.01
windows-1251 Russian confidence = 0.01
KOI8-R Russian confidence = 0.01
ISO-8859-5 Russian confidence = 0.01
MacCyrillic Russian confidence = 0.0
IBM866 Russian confidence = 0.0
IBM855 Russian confidence = 0.01
ISO-8859-7 Greek confidence = 0.01
windows-1253 Greek confidence = 0.01
ISO-8859-5 Bulgarian confidence = 0.01
windows-1251 Bulgarian confidence = 0.01
TIS-620 Thai confidence = 0.01
ISO-8859-9 Turkish confidence = 0.6157780896218796
windows-1255 Hebrew confidence = 0.0
windows-1255 Hebrew confidence = 0.01
windows-1255 Hebrew confidence = 0.01
windows-1251 Russian confidence = 0.01
KOI8-R Russian confidence = 0.01
ISO-8859-5 Russian confidence = 0.01
MacCyrillic Russian confidence = 0.0
IBM866 Russian confidence = 0.0
IBM855 Russian confidence = 0.01
ISO-8859-7 Greek confidence = 0.01
windows-1253 Greek confidence = 0.01
ISO-8859-5 Bulgarian confidence = 0.01
windows-1251 Bulgarian confidence = 0.01
TIS-620 Thai confidence = 0.01
ISO-8859-9 Turkish confidence = 0.6157780896218796
windows-1255 Hebrew confidence = 0.0
windows-1255 Hebrew confidence = 0.01
windows-1255 Hebrew confidence = 0.01
Traceback (most recent call last):
File "/home/vagrant/.virtualenvs/foo/bin/deptry", line 8, in
sys.exit(deptry())
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/cli.py", line 198, in deptry
).run()
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/core.py", line 61, in run
imported_modules = ImportParser().get_imported_modules_for_list_of_files(all_python_files)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 24, in get_imported_modules_for_list_of_files
modules_per_file = [self.get_imported_modules_from_file(file) for file in list_of_files]
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 24, in
modules_per_file = [self.get_imported_modules_from_file(file) for file in list_of_files]
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 36, in get_imported_modules_from_file
modules = self._get_imported_modules_from_py(path_to_file)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 51, in _get_imported_modules_from_py
root = ast.parse(f.read(), path_to_py_file) # type: ignore
File "/usr/lib/python3.10/encodings/cp1254.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 375: character maps to

fpgmaas · 2022-09-15T06:31:42Z

Weirdly enough, a file with the line my_string = '🐺' is also parsed correctly on my machine. So it is still not possible for me to reproduce the error.

I have decided to release 0.4.7 with the snippet of code of my comment above anyway, since with the knowledge we now have it seems like a bad idea to always use chardet, since most files will be UTF-8 anyway. The only problem is that I am not able to test if it resolves your issue, since I am not able to write a unit test that throws a UnicodeError after using both default UTF-8 and chardet.

Could you try with 0.4.7?

fpgmaas · 2022-09-15T19:03:43Z

Added a PR with a unit test for the warning logging when a file has encoding-issues: #106

fpgmaas · 2022-09-20T07:59:27Z

I believe this is fixed with the aforementioned PR

sbywater · 2022-09-20T14:18:33Z

I agree that this is now fixed.

wyattscarpenter · 2024-02-21T19:34:58Z

I'm getting this same unicode emoji issue on deptry 0.12.0, Windows 10, Python 3.11.0 ; the file with just my_string = '🐺' in it does not parse correctly, and instead I get this: Warning: File the_wolf.py could not be decoded. Skipping.... Adding a # coding = utf-8 line at the top does not help. (my_string = 'é' is fine, no problem.) The file is encoded as UTF-8. Changing the encoding to UTF-8 BOM, thus adding the UTF-8-encoded BOM, allows the file to be read by deptry just fine. Also, I happen to have WSL installed, and deptry 0.12.0 reads the file just fine when I run it through WSL, so I assume it's the Windows default encoding assumption in Python that is causing this problem to emerge.

sbywater added the bug Something isn't working label Sep 13, 2022

fpgmaas linked a pull request Sep 14, 2022 that will close this issue

detect file encoding with chardet before parsing the .py file #103

Merged

4 tasks

fpgmaas closed this as completed in #103 Sep 14, 2022

fpgmaas reopened this Sep 14, 2022

fpgmaas mentioned this issue Sep 15, 2022

changed parsing logic; only try to get encoding if initial parsing fails #105

Merged

4 tasks

fpgmaas closed this as completed Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternate file encodings throw UnicodeDecodeError #102

Alternate file encodings throw UnicodeDecodeError #102

sbywater commented Sep 13, 2022

fpgmaas commented Sep 13, 2022

fpgmaas commented Sep 13, 2022

sbywater commented Sep 13, 2022

fpgmaas commented Sep 14, 2022 •

edited

Loading

sbywater commented Sep 14, 2022

fpgmaas commented Sep 14, 2022

fpgmaas commented Sep 14, 2022

sbywater commented Sep 14, 2022 •

edited

Loading

fpgmaas commented Sep 15, 2022

fpgmaas commented Sep 15, 2022 •

edited

Loading

fpgmaas commented Sep 20, 2022

sbywater commented Sep 20, 2022

wyattscarpenter commented Feb 21, 2024 •

edited

Loading

Alternate file encodings throw UnicodeDecodeError #102

Alternate file encodings throw UnicodeDecodeError #102

Comments

sbywater commented Sep 13, 2022

fpgmaas commented Sep 13, 2022

fpgmaas commented Sep 13, 2022

sbywater commented Sep 13, 2022

fpgmaas commented Sep 14, 2022 • edited Loading

sbywater commented Sep 14, 2022

fpgmaas commented Sep 14, 2022

fpgmaas commented Sep 14, 2022

sbywater commented Sep 14, 2022 • edited Loading

fpgmaas commented Sep 15, 2022

fpgmaas commented Sep 15, 2022 • edited Loading

fpgmaas commented Sep 20, 2022

sbywater commented Sep 20, 2022

wyattscarpenter commented Feb 21, 2024 • edited Loading

fpgmaas commented Sep 14, 2022 •

edited

Loading

sbywater commented Sep 14, 2022 •

edited

Loading

fpgmaas commented Sep 15, 2022 •

edited

Loading

wyattscarpenter commented Feb 21, 2024 •

edited

Loading