Cannot format non-ascii strings without unicode escape #4

ZeroRin · 2022-12-17T02:28:34Z

compact json uses json.dumps(element) without any configuration

compact-json/src/compact_json/formatter.py

Line 270 in 3b96495

simple_node.value = json.dumps(element)

so that there is no way to set ensure_ascii=False and the output will always be with unicode escaped characters, like "\u5f20\u4e09" instead of "张三"

I think there should be a option to control this behaviour to make the output more human friendly.

The text was updated successfully, but these errors were encountered:

masaccio · 2022-12-18T14:24:58Z

Would it be safe do you think to always use ensure_ascii=False? I am experimenting with replacing len(s) for the string value with:

def string_length(self, s):
    if self.east_asian_string_widths:
        length = sum([char_display_width(c) for c in s])
        return length
    else:
        
return len(s)

where string_length is taken from this:

# From https://github.com/ncm2/ncm2
# Copyright © 2018 [email protected]
def char_display_width(unicode_str):
    r = east_asian_width(unicode_str)
    if r == "F":  # Fullwidth
        return 1
    elif r == "H":  # Half-width
        return 1
    elif r == "W":  # Wide
        return 2
    elif r == "Na":  # Narrow
        return 1
    elif r == "A":  # Ambiguous, go with 2
        return 1
    elif r == "N":  # Neutral
        return 1
    else:
        return 1

I could be safe and test east_asian_string_widths to decide how to apply ensure_ascii. Doing this I get nicely formatted output like this:

[
    { "name": "Alice" , "age": "17"    , "occupation": "student"  },
    { "name": "Angela", "age": "42"    , "occupation": "Anwältin" },
    { "name": "张三"  , "age": "十七"  , "occupation": "学生"     },
    { "name": "依诺成", "age": "三十五", "occupation": "工程师"   }
]

This renders better for me in a terminal than on github.com

masaccio · 2022-12-18T14:29:54Z

See https://www.dropbox.com/s/qdihbgg6lhe5s8q/compact-json-issue-4.png?dl=0

masaccio · 2022-12-18T14:47:38Z

I've published a different change to the one in your pull request that uses a built-in for the character width. I don't look at cells.py much so I may have missed a load of corner-cases.

ZeroRin · 2022-12-18T15:39:00Z

Oh I didn't know that there is a built-in to do the calculation.
Different systems, fonts, and even consoles may treat same character differently so there can always be corner cases. As long as it's correct in most of the cases it should be OK. For special senarios users can override the default string_length anyway

Would it be safe do you think to always use ensure_ascii=False?

I think the answer is yes. The output with ensure_ascii=False should always be a valid json string (which can be put back to json.loads(string) and get the same object back). There could be situation where the downstream tasks only allow pure-ascii strings, but if a user set Formatter.east_asian_string_widths=True he should be clear about this

ZeroRin · 2022-12-19T03:50:31Z

just tested the char_display_width function with the cell_len I got, the difference are as follows:

Control characters and combining characters which takes 0 width are assigned with value 1. It seems difficult to determine which character is in this category with functions available in unicodedata.
Fullwidth characters (with r == "F", like ａ,ｂ,ｃ) should take 2 characters instead of 1. Change the return value to 2 would solve the problem. Notice that there are characters with r == "F" and takes 1 character width, but they are all undefined characters with a default display character 﫢. Such characters should be handled by the upstream system in my opinion, but if you want handle them correctly you may use:

    if r == "F":  # Fullwidth
        if unicodedata.name(unicode_str, False):
            return 2
        return 1

masaccio · 2022-12-19T09:04:17Z

Do you have some example text that I can include in a test for this? Current test is https://github.com/masaccio/compact-json/blob/main/tests/data/test-issue-4.json

ZeroRin · 2022-12-19T10:08:27Z

I just created a example with country names, with original data and what I got with cell_len

The output is perfectly aligned at least in terminal:

masaccio · 2022-12-19T19:23:04Z

After looking into some corner cases where the character size is zero, I discovered https://github.com/jquast/wcwidth which seems to do what we need out-of-the-box.

The results for me are still not perfectly aligned though which is frustrating!

I've added assertions that indicate that the two calculations are identical:

elem.name_length = wcswidth(elem.name)
assert get_string_size(elem.name) == elem.name_length

But in my terminal at least the results are still not aligned. What do you see? tests/data/test-issue-4.ref-1.json doesn't align in my terminal.

masaccio · 2022-12-19T19:50:31Z

ZeroRin · 2022-12-20T02:49:08Z

The result with wcswidth is identical to the cell_len I use (the rich document said it is derived from wcwidth anyway)

And that's where things get complicated. For those rare characters, the display for different language could be very different on the environment. On the PC I'm currently using, test-issue-4.ref-1.json is only perfectly aligned in vscode integrated terminal. It looks different in editor and in external Window Terminal. With another computer I have it is not even aligned in vscode terminal.

masaccio · 2022-12-20T09:19:48Z

I will back out the cell_len implementation then and just use wcswidth. I guess it's very sensitive to the font in use so I am going to make the bold assertion that people reading such languages will use a font that fully supports its spacing. My en_GB Terminal.app doesn't do that it seems.

masaccio closed this as completed Dec 20, 2022

ZeroRin mentioned this issue Jan 5, 2023

Let string splitters respect East_Asian_Width property psf/black#3445

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot format non-ascii strings without unicode escape #4

Cannot format non-ascii strings without unicode escape #4

ZeroRin commented Dec 17, 2022

masaccio commented Dec 18, 2022

masaccio commented Dec 18, 2022

masaccio commented Dec 18, 2022

ZeroRin commented Dec 18, 2022

ZeroRin commented Dec 19, 2022 •

edited

Loading

masaccio commented Dec 19, 2022

ZeroRin commented Dec 19, 2022 •

edited

Loading

masaccio commented Dec 19, 2022 •

edited

Loading

masaccio commented Dec 19, 2022

ZeroRin commented Dec 20, 2022

masaccio commented Dec 20, 2022

Cannot format non-ascii strings without unicode escape #4

Cannot format non-ascii strings without unicode escape #4

Comments

ZeroRin commented Dec 17, 2022

masaccio commented Dec 18, 2022

masaccio commented Dec 18, 2022

masaccio commented Dec 18, 2022

ZeroRin commented Dec 18, 2022

ZeroRin commented Dec 19, 2022 • edited Loading

masaccio commented Dec 19, 2022

ZeroRin commented Dec 19, 2022 • edited Loading

masaccio commented Dec 19, 2022 • edited Loading

masaccio commented Dec 19, 2022

ZeroRin commented Dec 20, 2022

masaccio commented Dec 20, 2022

ZeroRin commented Dec 19, 2022 •

edited

Loading

ZeroRin commented Dec 19, 2022 •

edited

Loading

masaccio commented Dec 19, 2022 •

edited

Loading