Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Char property macro prototype 1 #41

Closed
wants to merge 6 commits into from
Closed

Char property macro prototype 1 #41

wants to merge 6 commits into from

Conversation

CAD97
Copy link
Collaborator

@CAD97 CAD97 commented Jul 19, 2017

This PR provides a macro to somewhat automate the implementation of Unicode Catalog and Enumeration property types (and uses it in ucd/bidi and ucd/category).

I haven't touched ucd/normal because I have yet to do the necessary reading about the composition/decomposition specification in Unicode that is a prerequisite to me trusting myself to make changes in it.

Open questions:

  • Should this be in unic-ucd-core? I'm starting to think so; every UCD crate will almost certainly want to use the macro if they expose a Catalog or Enumeration property. And since it's a macro, they just wouldn't #[macro_use] on their import (which they need for UnicodeVersion).
  • Should we provide a try_from_name(&str) -> Option<$name> (or similar) fn? To do so would entail adding the Long_Name form to the macro definition. (It's just an $ident and syntax bikeshedding away from being added.)
  • Any further macro syntax bikeshedding. I chose the current one for implementation because it's simple and doesn't run into the problem of wanting to have optional or reordered "members" of the variant (which my macro-fu is not good enough to support yet).
  • The mythical UnicodeProperty trait. It's a seperate issue to this macro, but related. Since Number properties exist, I'm thinking the only across-all-properties functionality we can promise is an of(ch: char) -> Self. Should we also have a UnicodePropertyWithAbbreviatedName (horrible example name) that exposes abbr_name(&self) and name(&self)? In any case, since they're all constructed the same way with the same methods they should probably be promised in a trait, even if they remain intrinsic methods.

Bidi_Class uses names with spaces and hyphens, meaning that the naive $ident match will not work. This also cuts down on chaff, because there are no longer two matches in the macro or repeated information (other than that in the docs).

The doc repetition I think is valid, as docs should be considered separately and just populating them with the display name would be redundant.
The display names come from Unicode Chapter 4 <http://www.unicode.org/versions/Unicode10.0.0/ch04.pdf> Table 4-4. General Category
@CAD97
Copy link
Collaborator Author

CAD97 commented Jul 19, 2017

-.- how did I miss my own tests...

@behnam
Copy link
Member

behnam commented Jul 19, 2017

This is cool! Thanks for working on it, @CAD97. It was fast!

I haven't touched ucd/normal because I have yet to do the necessary reading about the composition/decomposition specification in Unicode that is a prerequisite to me trusting myself to make changes in it.

I think there are a couple of helper functions in ucd/normal/ that I think belong to normal/. Just FYI. We need to fix that when we get back there.

Should this be in unic-ucd-core? I'm starting to think so; every UCD crate will almost certainly want to use the macro if they expose a Catalog or Enumeration property. And since it's a macro, they just wouldn't #[macro_use] on their import (which they need for UnicodeVersion).

I think this belongs to a non-ucd component, like unic::utils::char_property, because it has nothing to do with UCD data (and specially unicode-version, which is the base of ucd::core), and other components may want to use it without having anything to do with UCD. We don't have such a component yet, and IDNA, Emoji and other components with char-property still need some UCD access. But, it's still valid for external users.

So, I was thinking that any helper code, meaning not depending on unicode/i18n data and algorithm, should go under unic::utils.

Should we provide a try_from_name(&str) -> Option<$name> (or similar) fn?

I think yes. For now, mainly because we usually need these for conformance tests.

To do so would entail adding the Long_Name form to the macro definition. (It's just an $ident and syntax bikeshedding away from being added.)

I think there are a bunch of fields that need to be defined for each variant. That's why I initially suggested to have a named syntax, instead of an ordered syntax. And, having the names, many of the fields would be optional, and fall back to another one if not provided.

For example, there are property values with equal abbr and long names. If we want to provide one function to match both (which I think we should), we only need to list these once. Leaving long_name field empty in the variant definition would be a sign for this.

Also, maybe we can even auto-gen long names from variant names. There are a couple of small crates for this, and I think we can depend on them, or just implement/copy it from there under its own utils crate, in case we need a very small portion of it, or want to customize and improve.

Any further macro syntax bikeshedding. I chose the current one for implementation because it's simple and doesn't run into the problem of wanting to have optional or reordered "members" of the variant (which my macro-fu is not good enough to support yet).

I think optional fields would be a great thing to have here, because otherwise a lot of manual formatting/etc will be needed for each variant.

If we keep the diff limited to implementing the macro and an integration test for it, we can land it and work on it while the rest of the code is evolving, and switch whenever we feel it's ready.

The mythical UnicodeProperty trait. [...]

I want to work on these traits before we get the macro out, because, as you already mentioned, this sets the contract for the macro.

What I have in mind at the moment is one simple trait UnicodeProperty, which applies to all types of properties. And, then, have sub-traits like UnicodeEnumeratedProperty, which has to be enum and have lots of goodies. We can have long_name(), etc directly on UnicodeEnumeratedProperty until we find the sweet spot for a new trait that provide those things (and not be an enum).

Let me take a shot at this tomorrow and we talk on that PR.

WDYT?

@CAD97
Copy link
Collaborator Author

CAD97 commented Jul 19, 2017

You take a shot and see what you can get. The annotated members is probably a better design for the macro; I just lack the macro knowledge required for the munching required to allow the flexibility it implies.

I agree that adding the formal UnicodeProperty trait at the same time as this probably would make sense; would the trait live in unic::utils as well? I do think that the macro for implementing the trait should be alongside the trait it implements.

I'll make the simple changes you've suggested for a fair comparison, but I can't wait to see what you come up with. Just keep this macro limitation in mind; I've required exactly one attribute before each enum variant because of it, but you could also choose to require a sigil (such as case (probably, untested) or a single symbol, such as @) before each variant.

@CAD97
Copy link
Collaborator Author

CAD97 commented Jul 19, 2017

If it's possible to rewrite a bound $ident "BidiClass" to "Bidi_Class" (with macro_rules!), I can't find it; I don't think it's possible.

The reason for taking in the long name form is to allow using it as an &'static str rather than having to allocate space on the heap and transform a slightly different str into a correct String.

@behnam
Copy link
Member

behnam commented Jul 19, 2017

I agree that adding the formal UnicodeProperty trait at the same time as this probably would make sense; would the trait live in unic::utils as well? I do think that the macro for implementing the trait should be alongside the trait it implements.

Actually, I remembered this morning that I meant CharProperty trait, but not UnicodeProperty. I don't know if we ever want UnicodeProperty, since there's nothing special about Unicode-defined/UCD-defined properties comparing to other ones, like those defined in Unicode IDNA/Emoji specs, or the IETF-defined ones.

So, I think we can put both trait CharProperty and macro char_property in the same crate: unic::utils::char_property.

I'll submit a PR today.

@CAD97 CAD97 changed the title Char property macro Char property macro prototype 1 Jul 19, 2017
@CAD97 CAD97 mentioned this pull request Jul 23, 2017
@CAD97
Copy link
Collaborator Author

CAD97 commented Jul 23, 2017

Supplanted by #48

@CAD97 CAD97 closed this Jul 23, 2017
@CAD97 CAD97 deleted the char-property-macro branch August 1, 2017 03:49
bors bot added a commit that referenced this pull request Aug 10, 2017
48: Char property macro 2.0 r=behnam

Replaces #41. See #41 for earlier discussion.

An example will show better than I can tell:

```rust
char_property! {
    /// Represents the Unicode character
    /// [*Bidi_Class*](http://www.unicode.org/reports/tr44/#Bidi_Class) property,
    /// also known as the *bidirectional character type*.
    ///
    /// * <http://www.unicode.org/reports/tr9/#Bidirectional_Character_Types>
    /// * <http://www.unicode.org/reports/tr44/#Bidi_Class_Values>
    pub enum BidiClass {
        /// Any strong left-to-right character
        ///
        /// ***General Scope***
        ///
        /// LRM, most alphabetic, syllabic, Han ideographs,
        /// non-European or non-Arabic digits, ...
        LeftToRight {
            abbr => L,
            long => Left_To_Right,
            display => "Left-to-Right",
        }

        /// Any strong right-to-left (non-Arabic-type) character
        ///
        /// ***General Scope***
        ///
        /// RLM, Hebrew alphabet, and related punctuation
        RightToLeft {
            abbr => R,
            long => Right_To_Left,
            display => "Right-to-Left",
        }

        /// Any strong right-to-left (Arabic-type) character
        ///
        /// ***General Scope***
        ///
        /// ALM, Arabic, Thaana, and Syriac alphabets,
        /// most punctuation specific to those scripts, ...
        ArabicLetter {
            abbr => AL,
            long => Arabic_Letter,
            display => "Right-to-Left Arabic",
        }
    }
}

/// Abbreviated name bindings for the `BidiClass` property
pub mod abbr_names for abbr;
/// Name bindings for the `BidiClass` property as they appear in Unicode documentation
pub mod long_names for long;
```

expands to:

```rust
/// Represents the Unicode character
/// [*Bidi_Class*](http://www.unicode.org/reports/tr44/#Bidi_Class) property,
/// also known as the *bidirectional character type*.
///
/// * <http://www.unicode.org/reports/tr9/#Bidirectional_Character_Types>
/// * <http://www.unicode.org/reports/tr44/#Bidi_Class_Values>
#[allow(bad_style)]
#[derive(Copy, Clone, Debug, Eq, PartialEq, Hash)]
pub enum BidiClass {
    /// Any strong left-to-right character
    LeftToRight,
    /// Any strong right-to-left (non-Arabic-type) character
    RightToLeft,
    /// Any strong right-to-left (Arabic-type) character
    ArabicLetter,
}
/// Abbreviated name bindings for the `BidiClass` property
#[allow(bad_style)]
pub mod abbr_names {
    pub use super::BidiClass::LeftToRight as L;
    pub use super::BidiClass::RightToLeft as R;
    pub use super::BidiClass::ArabicLetter as AL;
}
/// Name bindings for the `BidiClass` property as they appear in Unicode documentation
#[allow(bad_style)]
pub mod long_names {
    pub use super::BidiClass::LeftToRight as Left_To_Right;
    pub use super::BidiClass::RightToLeft as Right_To_Left;
    pub use super::BidiClass::ArabicLetter as Arabic_Letter;
}
#[allow(bad_style)]
#[allow(unreachable_patterns)]
impl ::std::str::FromStr for BidiClass {
    type Err = ();
    fn from_str(s: &str) -> Result<Self, Self::Err> {
        match s {
            "LeftToRight" => Ok(BidiClass::LeftToRight),
            "RightToLeft" => Ok(BidiClass::RightToLeft),
            "ArabicLetter" => Ok(BidiClass::ArabicLetter),
            "L" => Ok(BidiClass::LeftToRight),
            "R" => Ok(BidiClass::RightToLeft),
            "AL" => Ok(BidiClass::ArabicLetter),
            "Left_To_Right" => Ok(BidiClass::LeftToRight),
            "Right_To_Left" => Ok(BidiClass::RightToLeft),
            "Arabic_Letter" => Ok(BidiClass::ArabicLetter),
            _ => Err(()),
        }
    }
}
#[allow(bad_style)]
#[allow(unreachable_patterns)]
impl ::std::fmt::Display for BidiClass {
    fn fmt(&self, f: &mut ::std::fmt::Formatter) -> ::std::fmt::Result {
        match *self {
            BidiClass::LeftToRight => write!(f, "{}", "Left-to-Right"),
            BidiClass::RightToLeft => write!(f, "{}", "Right-to-Left"),
            BidiClass::ArabicLetter => write!(f, "{}", "Right-to-Left Arabic"),
            BidiClass::LeftToRight => write!(f, "{}", "Left_To_Right".replace('_', " ")),
            BidiClass::RightToLeft => write!(f, "{}", "Right_To_Left".replace('_', " ")),
            BidiClass::ArabicLetter => write!(f, "{}", "Arabic_Letter".replace('_', " ")),
            _ => {
                write!(
                    f,
                    "{}",
                    match *self {
                        BidiClass::LeftToRight => "L",
                        BidiClass::RightToLeft => "R",
                        BidiClass::ArabicLetter => "AL",
                        BidiClass::LeftToRight => "LeftToRight",
                        BidiClass::RightToLeft => "RightToLeft",
                        BidiClass::ArabicLetter => "ArabicLetter",
                    }
                )
            }
        }
    }
}
#[allow(bad_style)]
impl ::char_property::EnumeratedCharProperty for BidiClass {
    fn abbr_name(&self) -> &'static str {
        match *self {
            BidiClass::LeftToRight => "L",
            BidiClass::RightToLeft => "R",
            BidiClass::ArabicLetter => "AL",
        }
    }
    fn all_values() -> &'static [BidiClass] {
        const VALUES: &[BidiClass] = &[
            BidiClass::LeftToRight,
            BidiClass::RightToLeft,
            BidiClass::ArabicLetter,
        ];
        VALUES
    }
}
```

All three of the `abbr`, `long`, and `display` properties of the enum are optional, and have sane fallbacks: `abbr_name` and `long_name` return `None` if unspecified, and `fmt::Display` will check, in order, for `display`, `long_name`, `abbr_name`, and the variant name until it finds one to use (stringified, of course).

`FromStr` is defined, matching against any of the provided `abbr`, `long`, and variant name.

<hr />

Important notes:

- <strike>The current format uses associated consts, so it works on beta but won't work on stable until 1.20 is stable.</strike>
  - Consts have a slightly different meaning than `pub use` -- `pub use` aliases the type where `const` is a new object and if used in pattern matching is a `==` call and not a pattern match.
  - For this reason I'm actually slightly leaning towards using `pub use` even once associated consts land; they're compartmentalized (so `use Property::*` doesn't pull in 3x as many symbols as there are variants). After using the const based aliasing for a little bit, I'm inclined to like the current solution of `unic::ucd::bidi::BidiClass::*` + `unic::ucd::bidi::bidi_class::abbr_names::*`. These really should be a `pub use` and not a `const`.
  - Note that I still think `const` are the way to go for cases like `Canonical_Combining_Class`, though.
- <strike>The current syntax could easily be adapted to use modules instead of associated consts, but was written with the associated consts so we could get a feel of how it would look with them.</strike>
- The zero-or-more meta match before a enum variant conflicts with the ident match before 1.20. See rust-lang/rust#42913, rust-lang/rust#24189
- There only tests of the macro are rather thin and could be expanded.
- It's a macro, so the response when you stick stuff not matching the expected pattern is cryptic at best.
- The `CharProperty` trait is pretty much the lowest common denominator. It's a starting point, and we can iterate from there.
- How and where do we want to make `CharProperty` a externally visible trait? Currently having it in namespace is the only way to access `abbr_name` and `long_name`.
- <strike>Earlier discussion suggested putting these into `unic::utils::char_property`. Moving it would be simple, but for now it's living in the root of `unic-utils`</strike>
- <strike>The crate `unic-utils` is currently in the workspace by virtue of being a dependency of `unic`, but is not in any way visible a crate depending on `unic`.</strike>
- <strike>Documentation doesn't exist.</strike>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants