Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Unicode version codepoint was added #48

Open
wezm opened this issue Jun 17, 2020 · 5 comments
Open

Suggestion: Unicode version codepoint was added #48

wezm opened this issue Jun 17, 2020 · 5 comments

Comments

@wezm
Copy link

wezm commented Jun 17, 2020

I deal with Unicode a fair bit and chars is a handy tool. Sometimes it would be convenient to know which Unicode version assigned a particular codepoint.

E.g the output from chars might look something like this. The version information might not be shown by default and require a command line flag if it was deemed too noisy.

$ chars party
U+0001F973, 🥳 0x0001F973, \0374563, UTF-8: f0 9f a5 b3, UTF-16BE: d83edd73
Width: 2, prints as 🥳
Quotes as \u{1f973}
Unicode name: FACE WITH PARTY HORN AND PARTY HAT
Unicode version: 11.0

U+0001F389, 🎉 0x0001F389, \0371611, UTF-8: f0 9f 8e 89, UTF-16BE: d83cdf89
Width: 2, prints as 🎉
Quotes as \u{1f389}
Unicode name: PARTY POPPER
Unicode version: 6.0

I think the information is available via the DerivedAge.txt file in the UCD.

@antifuchs
Copy link
Collaborator

This is a marvelous idea! Thanks for submitting it! :D

I'm not sure I can take a look at this in the next few weeks, but would love to have this feature. If you want to take a stab at it, I can probably give you enough guidance to get you started, though (:

@wezm
Copy link
Author

wezm commented Jun 17, 2020

I might be able to take a look on the weekend. Did you have and preferences/thoughts regarding whether the version information was output by default?

@antifuchs
Copy link
Collaborator

I think showing the version unconditionally would be just fine - chars is somewhat aggressively non-configurable and maximally informative for human users, so just adding it would work well (:

To add this feature, I think it's a two/three step process:

  1. you'd add a task to fetch data file to the chars_data subcrate in the chars workspace here,
  2. update write_name_data in the unicode portion to emit another table giving unicode versions & the ranges added in them (ideally make it a memory-optimized data structure; I don't extremely mind searching through n*13ish unicode versions for each character, but would be worried if we added a table mapping each character to a version number... maybe there's something one could do with tries though?)
  3. Update the Codepoint Display impl's branch for Unicode here-ish to show the version number.

...and that's about it, I think! The main difficulty will probably be making a parser for that data file (the ones I made I got by with making a regex-based one, but feel free to use any other reasonable method, tbqh) and finding a decently space-efficient repr for the version table. Best of luck!

@wezm
Copy link
Author

wezm commented Jun 21, 2020

I made a start on this yesterday. I'm 50–75% done. Fortunately I think what you described above matches what I did/planned to do 😃

@antifuchs
Copy link
Collaborator

That's fantastic to hear - excited to see what you came up with (:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants