Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an expressions string lookup operator that returns the script of the string? #5807

Closed
nickidlugash opened this issue Dec 5, 2017 · 6 comments

Comments

@nickidlugash
Copy link

It could be useful to have an expressions string lookup operator that returns the script name of a string. We could use it to style different scripts differently, or only display text for certain scripts.

We currently pull in unicode block data to aid in some script detection checks for text layout – perhaps we can add additional unicode data here for assigning the name of the written script? I think we would need something along the lines of this: http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt

We may need logic for deciding what the overall script of a string is (for mixed-script strings), but it might be good to develop similar logic anyway for text layout considerations like horizontal text runs within vertical text?

In addition, we could also return a flag for whether we can display a script accurately (based on complex text shaping needs, and possibly other factors). This could help us/customers create better internationalized maps by making it simpler to use our local language name field (or other localize data sources) but allow an alternative display for poorly rendered scripts (e.g. display English labels instead).

I discussed this briefly with @ChrisLoer, and initial thoughts were that this seems like a reasonable feature to discuss adding, both in terms of implementation and usefulness. Looking forward to hearing other thoughts on this!

/cc @anandthakker @jfirebaugh @kkaefer @jcsg @1ec5 @ajashton

@1ec5
Copy link
Contributor

1ec5 commented Dec 6, 2017

We currently pull in unicode block data to aid in some script detection checks for text layout – perhaps we can add additional unicode data here for assigning the name of the written script? I think we would need something along the lines of this: http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt

is_char_in_unicode_block.js comes from the Unicode Character Database’s Scripts.txt. It would be straightforward to add a function that returns the matching key instead of a Boolean, but we’d need to first uncomment all the code blocks that are currently commented out. (We commented them out because we didn’t need them for the purposes of vertical text and ideographic line breaking detection.)

It’s worth noting that GL JS is currently unable to display anything in the supplementary planes: #4001. We could still add those blocks to the list for the purpose of this expression operator, though.

@jfirebaugh
Copy link
Contributor

Scripts and Blocks are two different things, and I expect that there's not a 1-1 correspondence.

I'm not super familiar with the state of the art for script detection, but I expect that, like most things related to human language, it's pretty tricky and full of subtle nuances. My guess is that a categorization of "primary script" as described by @nickidlugash is something that's better done as a processing step when the datasource is built, and included as a feature property.

I guess what I'm getting at is that I think we should step back and capture what the underlying requirements here are, as I'm not sure the feature as described is a good fit.

@1ec5
Copy link
Contributor

1ec5 commented Dec 14, 2017

Script detection thankfully isn’t quite as difficult a problem as language detection, but you’re right that script detection requires more nuance than what is_char_in_unicode_block.js provides. For the most part, a script corresponds to one or more Unicode blocks, but a block doesn’t necessarily correspond to a single script. If you follow ISO 15924’s definition of a script:

  • Greek (Grek) and Coptic (Copt) share the Greek and Coptic block.
  • Some blocks, such as Combining Diacritical Marks, don’t correspond to a specific script.
  • Japanese (Japn) is an umbrella script that includes Hiragana (Hira), Katakana (Kana), and Kanji (Hani) scripts.
  • Simplified Chinese (Hans) and Traditional Chinese (Hant) codepoints are intermingled within the various CJK blocks.

We probably could expose the data from Scripts.txt without much effort or size increase. However, I agree that we may end up needing something a bit different depending on the intended use cases.

@1ec5
Copy link
Contributor

1ec5 commented Feb 15, 2018

allow an alternative display for poorly rendered scripts (e.g. display English labels instead)

For this use case specifically, it would be straightforward to detect codepoints that require unsupported typographic features. Instead of exposing an open-ended script lookup operator, how about a simpler operator that returns whether GL thinks it can render a given string? Then a style could combine that with the case operator to fall back to a different, more easily rendered property, such as name_en in the Mapbox Streets source.

@ChrisLoer
Copy link
Contributor

Here's a possible implementation of "can we render this character" which captures about what our expectations are but hopefully also shows the limitations of this approach:

exports.charInRenderableScript = function(char: number, canRenderRTL: boolean) {
    // This is a rough heuristic: whether we "can render" a script
    // actually depends on the properties of the font being used
    // and whether differences from the ideal rendering are considered
    // semantically significant.

    // Even in Latin script, we "can't render" combinations such as the fi
    // ligature, but we don't consider that semantically significant.
    if (!canRenderRTL &&
        ((char >= 0x0590 && char <= 0x08FF) ||
         isChar['Arabic Presentation Forms-A'](char) ||
         isChar['Arabic Presentation Forms-B'](char))) {
        // Block out bulk of Hebrew, Arabic and other RTL scripts
        return false;
    }
    if (char >= 0x0590 && char <= 0x109F) {
        // Block out Indic and Southeast Asian scripts from Devanagari
        // to Myanmar. Note that some scripts such as Thai and Lao mainly
        // rely on relatively simple diacritic placement, and depending
        // on the font, rendering may be legible if not fully correct.
        return false;
    }
    return true;
};

@ChrisLoer
Copy link
Contributor

Closing with #6260: we went with the more limited is-supported-script instead of trying for general purpose (and tricky to define correctly) "what script is this string" functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants