Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate file encodings throw UnicodeDecodeError #102

Closed
sbywater opened this issue Sep 13, 2022 · 13 comments · Fixed by #103
Closed

Alternate file encodings throw UnicodeDecodeError #102

sbywater opened this issue Sep 13, 2022 · 13 comments · Fixed by #103
Labels
bug Something isn't working

Comments

@sbywater
Copy link

Describe the bug

Python files that declare an alternate encoding throw a UnicodeDecodeError:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position n: invalid continuation byte

To Reproduce

Steps to reproduce the behavior:

  1. For a file under consideration, add a declaration like # -*- coding: iso-8859-15 -*-
  2. add a line such as my_string = 'é'
  3. Run deptry

Expected behavior

These files should be parsed correctly.

@sbywater sbywater added the bug Something isn't working label Sep 13, 2022
@fpgmaas
Copy link
Owner

fpgmaas commented Sep 13, 2022

Thanks for raising the issue. I have not worked with files with alternate encodings before, I will have a look and see if I can reproduce this and fix it tomorrow!

@fpgmaas
Copy link
Owner

fpgmaas commented Sep 13, 2022

Strange, I just tried to reproduce it but was not able to.

I added the following file and ran deptry:

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
my_string = """
Ax	NBSP	¡	¢	£	€	¥	Š	§	š	©	ª	«	¬	SHY	®	¯
Bx	°	±	²	³	Ž	µ	¶	·	ž	¹	º	»	Œ	œ	Ÿ	¿
Cx	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
Dx	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
Ex	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
Fx	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ
"""
import foo

deptry succesfully parsed this file and concluded that foo was a missing dependency.

So I think there is also a system-specific issue here? Maybe to avoid this error on all systems, we need to detect and explicitly specify specify the encoding while reading like shown [here](open('filename', encoding="ISO-8859-1")):

open('filename', encoding="ISO-8859-1")

But then we would need to detect the file encoding first.

Anyway, I do not have a lot of knowledge about encodings, so this might take me some time. Would also be good if I can find a way to reproduce this on my laptop. I will dive deeper into this issue tomorrow.

@sbywater
Copy link
Author

Maybe you are using Windows, where ISO-8859-1 can be an assumed encoding?

System:

OS: Ubuntu 22.06
Language Version: Python 3.10

@fpgmaas fpgmaas linked a pull request Sep 14, 2022 that will close this issue
4 tasks
@fpgmaas
Copy link
Owner

fpgmaas commented Sep 14, 2022

I'm using macOS 12.3.1 and Python 3.9.

I think the issue should now be solved in release 0.4.6. From this version, deptry tries to identify the file-encoding before reading it using chardet, see here in the code and the corresponding unit tests. Please let me know if this resolves your issue.

@fpgmaas fpgmaas reopened this Sep 14, 2022
@sbywater
Copy link
Author

I've updated to 0.4.6. New error is UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1498: character maps to

Here is the stack trace:

File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 36, in get_imported_modules_from_file modules = self._get_imported_modules_from_py(path_to_file) File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 51, in _get_imported_modules_from_py root = ast.parse(f.read(), path_to_py_file) # type: ignore File "/usr/lib/python3.10/encodings/cp1254.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1498: character maps to <undefined>

@fpgmaas
Copy link
Owner

fpgmaas commented Sep 14, 2022

Sorry that the implemented solution did not solve your problem.

It seems that chardet identifies an incorrect encoding for the file. I guess the only possible solution left is to catch this error and log a warning to the user that the specific file will be omitted while scanning for imports, since AFAIK there is no other way to identify the encoding.

@fpgmaas
Copy link
Owner

fpgmaas commented Sep 14, 2022

@sbywater Would it be possible for you to create a reproducible example? I currently fail to reproduce the error. I am currently thinking of implementing the following:

  • simply parse the file
  • If UnicodeDecodeError: guess the encoding, then parse the file
  • If still UnicodeDecodeError, skip the file.

Which would look as follows.

    def _get_imported_modules_from_py(self, path_to_py_file: Path) -> List[str]:
        try:
            with open(path_to_py_file) as f:
                root = ast.parse(f.read(), path_to_py_file)  # type: ignore
            import_nodes = self._get_import_nodes_from(root)
            return self._get_import_modules_from(import_nodes)
        except UnicodeDecodeError:
            return self._get_imported_modules_from_py_and_guess_encoding(path_to_py_file)

    def _get_imported_modules_from_py_and_guess_encoding(self, path_to_py_file: Path) -> List[str]:
        try:
            with open(path_to_py_file, encoding=self._get_file_encoding(path_to_py_file)) as f:
                root = ast.parse(f.read(), path_to_py_file)  # type: ignore
            import_nodes = self._get_import_nodes_from(root)
            return self._get_import_modules_from(import_nodes)
        except UnicodeDecodeError:
            logging.warning(f"Warning: File {path_to_py_file} could not be decoded. Skipping...")
            return []

But I fail to write a unit test without being able to reproduce the error first.

@sbywater
Copy link
Author

sbywater commented Sep 14, 2022

I can clarify now: the original problem file no longer throws an error. However, under 0.4.6 a file that worked before now throws the UnicodeDecodeError. The problem file does not declare a file encoding, and includes this code:

my_string = '🐺'

Let me know if you'd like me to create a new issue for this. Your proposed patch looks like a good solution.

Here is a verbose stack trace...

EUC-JP Japanese prober hit error at byte 374
EUC-KR Korean prober hit error at byte 374
CP949 Korean prober hit error at byte 374
Big5 Chinese prober hit error at byte 375
EUC-TW Taiwan prober hit error at byte 374
utf-8 not active
SHIFT_JIS Japanese confidence = 0.01
EUC-JP not active
GB2312 Chinese confidence = 0.01
EUC-KR not active
CP949 not active
Big5 not active
EUC-TW not active
Johab Korean confidence = 0.01
windows-1251 Russian confidence = 0.01
KOI8-R Russian confidence = 0.01
ISO-8859-5 Russian confidence = 0.01
MacCyrillic Russian confidence = 0.0
IBM866 Russian confidence = 0.0
IBM855 Russian confidence = 0.01
ISO-8859-7 Greek confidence = 0.01
windows-1253 Greek confidence = 0.01
ISO-8859-5 Bulgarian confidence = 0.01
windows-1251 Bulgarian confidence = 0.01
TIS-620 Thai confidence = 0.01
ISO-8859-9 Turkish confidence = 0.6157780896218796
windows-1255 Hebrew confidence = 0.0
windows-1255 Hebrew confidence = 0.01
windows-1255 Hebrew confidence = 0.01
windows-1251 Russian confidence = 0.01
KOI8-R Russian confidence = 0.01
ISO-8859-5 Russian confidence = 0.01
MacCyrillic Russian confidence = 0.0
IBM866 Russian confidence = 0.0
IBM855 Russian confidence = 0.01
ISO-8859-7 Greek confidence = 0.01
windows-1253 Greek confidence = 0.01
ISO-8859-5 Bulgarian confidence = 0.01
windows-1251 Bulgarian confidence = 0.01
TIS-620 Thai confidence = 0.01
ISO-8859-9 Turkish confidence = 0.6157780896218796
windows-1255 Hebrew confidence = 0.0
windows-1255 Hebrew confidence = 0.01
windows-1255 Hebrew confidence = 0.01
Traceback (most recent call last):
File "/home/vagrant/.virtualenvs/foo/bin/deptry", line 8, in
sys.exit(deptry())
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/cli.py", line 198, in deptry
).run()
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/core.py", line 61, in run
imported_modules = ImportParser().get_imported_modules_for_list_of_files(all_python_files)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 24, in get_imported_modules_for_list_of_files
modules_per_file = [self.get_imported_modules_from_file(file) for file in list_of_files]
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 24, in
modules_per_file = [self.get_imported_modules_from_file(file) for file in list_of_files]
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 36, in get_imported_modules_from_file
modules = self._get_imported_modules_from_py(path_to_file)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 51, in _get_imported_modules_from_py
root = ast.parse(f.read(), path_to_py_file) # type: ignore
File "/usr/lib/python3.10/encodings/cp1254.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 375: character maps to

@fpgmaas
Copy link
Owner

fpgmaas commented Sep 15, 2022

Weirdly enough, a file with the line my_string = '🐺' is also parsed correctly on my machine. So it is still not possible for me to reproduce the error.

I have decided to release 0.4.7 with the snippet of code of my comment above anyway, since with the knowledge we now have it seems like a bad idea to always use chardet, since most files will be UTF-8 anyway. The only problem is that I am not able to test if it resolves your issue, since I am not able to write a unit test that throws a UnicodeError after using both default UTF-8 and chardet.

Could you try with 0.4.7?

@fpgmaas
Copy link
Owner

fpgmaas commented Sep 15, 2022

Added a PR with a unit test for the warning logging when a file has encoding-issues: #106

@fpgmaas
Copy link
Owner

fpgmaas commented Sep 20, 2022

I believe this is fixed with the aforementioned PR

@fpgmaas fpgmaas closed this as completed Sep 20, 2022
@sbywater
Copy link
Author

I agree that this is now fixed.

@wyattscarpenter
Copy link

wyattscarpenter commented Feb 21, 2024

I'm getting this same unicode emoji issue on deptry 0.12.0, Windows 10, Python 3.11.0 ; the file with just my_string = '🐺' in it does not parse correctly, and instead I get this: Warning: File the_wolf.py could not be decoded. Skipping.... Adding a # coding = utf-8 line at the top does not help. (my_string = 'é' is fine, no problem.) The file is encoded as UTF-8. Changing the encoding to UTF-8 BOM, thus adding the UTF-8-encoded BOM, allows the file to be read by deptry just fine. Also, I happen to have WSL installed, and deptry 0.12.0 reads the file just fine when I run it through WSL, so I assume it's the Windows default encoding assumption in Python that is causing this problem to emerge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants