Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #82: correctly handle filenames reported by git.ls_files() when they contain unicode escapes #83

Conversation

cgrothaus
Copy link

Summary

This PR addresses an issue (#82) with the handling of filenames containing non-ASCII characters in the get_tracked_files method of the Coder class. The previous implementation was not correctly decoding the Unicode escape sequences in the filenames (which is how git.ls_files() returns them), leading to FileNotFoundError when trying to open these files.

Changes

The change is in the get_tracked_files method. When a filename is enclosed in quotes, indicating it contains special characters, we now strip those quotes and then decode the filename using a combination of 'latin1', 'unicode_escape', and 'utf-8' encodings. This ensures that the Unicode escape sequences are correctly decoded to their corresponding non-ASCII characters.

Here's the updated get_tracked_files method:

def get_tracked_files(self):
    if not self.repo:
        return []
    # convert to appropriate os.sep, since git always normalizes to /
    files = set(self.repo.git.ls_files().splitlines())
    # decode file names to UTF-8 when git reports them enclosed in quotes, and convert backslashes to os.sep
    res = set()
    for path in files:
        if path[0] == '"' and path[-1] == '"':
            # Strip quotes, convert escape sequence to characters,
            # convert to bytes and decode as UTF-8.
            path = (path[1:-1]
                        .encode('latin1')
                        .decode('unicode_escape')
                        .encode('latin1')
                        .decode('utf-8')
                    )
        res.add(str(Path(PurePosixPath(path))))

    return res

Testing

The changes have been tested with filenames containing non-ASCII characters on the https://github.com/cgrothaus/sample-repo-demonstrate-aider-bug-special-filenames repo, and the get_tracked_files method now correctly decodes these filenames. As a result, the FileNotFoundError no longer occurs when trying to open these files.

The changes have been tested on macOS with python 3.10.11.final.0. I did not test them on Windows.

Impact

This fix improves the robustness of aider when dealing with repositories containing files with non-ASCII characters in their names. It should not affect the functionality of aider in other aspects.

@cgrothaus
Copy link
Author

cgrothaus commented Jul 11, 2023

Note: I have close to no knowledge about python. This PR was mainly produced by LLMs, with me promting either ChatGPT via aider, or me prompting phind.com.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants