Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot format non-ascii strings without unicode escape #4

Closed
ZeroRin opened this issue Dec 17, 2022 · 11 comments
Closed

Cannot format non-ascii strings without unicode escape #4

ZeroRin opened this issue Dec 17, 2022 · 11 comments

Comments

@ZeroRin
Copy link
Contributor

ZeroRin commented Dec 17, 2022

compact json uses json.dumps(element) without any configuration

simple_node.value = json.dumps(element)

so that there is no way to set ensure_ascii=False and the output will always be with unicode escaped characters, like "\u5f20\u4e09" instead of "张三"

I think there should be a option to control this behaviour to make the output more human friendly.

@masaccio
Copy link
Owner

Would it be safe do you think to always use ensure_ascii=False? I am experimenting with replacing len(s) for the string value with:

def string_length(self, s):
    if self.east_asian_string_widths:
        length = sum([char_display_width(c) for c in s])
        return length
    else:
        
return len(s)

where string_length is taken from this:

# From https://github.com/ncm2/ncm2
# Copyright © 2018 [email protected]
def char_display_width(unicode_str):
    r = east_asian_width(unicode_str)
    if r == "F":  # Fullwidth
        return 1
    elif r == "H":  # Half-width
        return 1
    elif r == "W":  # Wide
        return 2
    elif r == "Na":  # Narrow
        return 1
    elif r == "A":  # Ambiguous, go with 2
        return 1
    elif r == "N":  # Neutral
        return 1
    else:
        return 1

I could be safe and test east_asian_string_widths to decide how to apply ensure_ascii. Doing this I get nicely formatted output like this:

[
    { "name": "Alice" , "age": "17"    , "occupation": "student"  },
    { "name": "Angela", "age": "42"    , "occupation": "Anwältin" },
    { "name": "张三"  , "age": "十七"  , "occupation": "学生"     },
    { "name": "依诺成", "age": "三十五", "occupation": "工程师"   }
]

This renders better for me in a terminal than on github.com

@masaccio
Copy link
Owner

@masaccio
Copy link
Owner

I've published a different change to the one in your pull request that uses a built-in for the character width. I don't look at cells.py much so I may have missed a load of corner-cases.

@ZeroRin
Copy link
Contributor Author

ZeroRin commented Dec 18, 2022

Oh I didn't know that there is a built-in to do the calculation.
Different systems, fonts, and even consoles may treat same character differently so there can always be corner cases. As long as it's correct in most of the cases it should be OK. For special senarios users can override the default string_length anyway

Would it be safe do you think to always use ensure_ascii=False?

I think the answer is yes. The output with ensure_ascii=False should always be a valid json string (which can be put back to json.loads(string) and get the same object back). There could be situation where the downstream tasks only allow pure-ascii strings, but if a user set Formatter.east_asian_string_widths=True he should be clear about this

@ZeroRin
Copy link
Contributor Author

ZeroRin commented Dec 19, 2022

just tested the char_display_width function with the cell_len I got, the difference are as follows:

  • Control characters and combining characters which takes 0 width are assigned with value 1. It seems difficult to determine which character is in this category with functions available in unicodedata.
  • Fullwidth characters (with r == "F", like a,b,c) should take 2 characters instead of 1. Change the return value to 2 would solve the problem. Notice that there are characters with r == "F" and takes 1 character width, but they are all undefined characters with a default display character 﫢. Such characters should be handled by the upstream system in my opinion, but if you want handle them correctly you may use:
    if r == "F":  # Fullwidth
        if unicodedata.name(unicode_str, False):
            return 2
        return 1

@masaccio
Copy link
Owner

Do you have some example text that I can include in a test for this? Current test is https://github.com/masaccio/compact-json/blob/main/tests/data/test-issue-4.json

@ZeroRin
Copy link
Contributor Author

ZeroRin commented Dec 19, 2022

I just created a example with country names, with original data and what I got with cell_len

The output is perfectly aligned at least in terminal:
image

@masaccio
Copy link
Owner

masaccio commented Dec 19, 2022

After looking into some corner cases where the character size is zero, I discovered https://github.com/jquast/wcwidth which seems to do what we need out-of-the-box.

The results for me are still not perfectly aligned though which is frustrating!

I've added assertions that indicate that the two calculations are identical:

elem.name_length = wcswidth(elem.name)
assert get_string_size(elem.name) == elem.name_length

But in my terminal at least the results are still not aligned. What do you see? tests/data/test-issue-4.ref-1.json doesn't align in my terminal.

@masaccio
Copy link
Owner

Screenshot 2022-12-19 at 19 49 51

@ZeroRin
Copy link
Contributor Author

ZeroRin commented Dec 20, 2022

The result with wcswidth is identical to the cell_len I use (the rich document said it is derived from wcwidth anyway)

And that's where things get complicated. For those rare characters, the display for different language could be very different on the environment. On the PC I'm currently using, test-issue-4.ref-1.json is only perfectly aligned in vscode integrated terminal. It looks different in editor and in external Window Terminal. With another computer I have it is not even aligned in vscode terminal.

@masaccio
Copy link
Owner

I will back out the cell_len implementation then and just use wcswidth. I guess it's very sensitive to the font in use so I am going to make the bold assertion that people reading such languages will use a font that fully supports its spacing. My en_GB Terminal.app doesn't do that it seems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants