Add an expressions string lookup operator that returns the script of the string? #5807

nickidlugash · 2017-12-05T01:37:15Z

It could be useful to have an expressions string lookup operator that returns the script name of a string. We could use it to style different scripts differently, or only display text for certain scripts.

We currently pull in unicode block data to aid in some script detection checks for text layout – perhaps we can add additional unicode data here for assigning the name of the written script? I think we would need something along the lines of this: http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt

We may need logic for deciding what the overall script of a string is (for mixed-script strings), but it might be good to develop similar logic anyway for text layout considerations like horizontal text runs within vertical text?

In addition, we could also return a flag for whether we can display a script accurately (based on complex text shaping needs, and possibly other factors). This could help us/customers create better internationalized maps by making it simpler to use our local language name field (or other localize data sources) but allow an alternative display for poorly rendered scripts (e.g. display English labels instead).

I discussed this briefly with @ChrisLoer, and initial thoughts were that this seems like a reasonable feature to discuss adding, both in terms of implementation and usefulness. Looking forward to hearing other thoughts on this!

/cc @anandthakker @jfirebaugh @kkaefer @jcsg @1ec5 @ajashton

1ec5 · 2017-12-06T23:21:29Z

We currently pull in unicode block data to aid in some script detection checks for text layout – perhaps we can add additional unicode data here for assigning the name of the written script? I think we would need something along the lines of this: http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt

is_char_in_unicode_block.js comes from the Unicode Character Database’s Scripts.txt. It would be straightforward to add a function that returns the matching key instead of a Boolean, but we’d need to first uncomment all the code blocks that are currently commented out. (We commented them out because we didn’t need them for the purposes of vertical text and ideographic line breaking detection.)

It’s worth noting that GL JS is currently unable to display anything in the supplementary planes: #4001. We could still add those blocks to the list for the purpose of this expression operator, though.

jfirebaugh · 2017-12-14T01:49:06Z

Scripts and Blocks are two different things, and I expect that there's not a 1-1 correspondence.

I'm not super familiar with the state of the art for script detection, but I expect that, like most things related to human language, it's pretty tricky and full of subtle nuances. My guess is that a categorization of "primary script" as described by @nickidlugash is something that's better done as a processing step when the datasource is built, and included as a feature property.

I guess what I'm getting at is that I think we should step back and capture what the underlying requirements here are, as I'm not sure the feature as described is a good fit.

1ec5 · 2017-12-14T05:57:53Z

Script detection thankfully isn’t quite as difficult a problem as language detection, but you’re right that script detection requires more nuance than what is_char_in_unicode_block.js provides. For the most part, a script corresponds to one or more Unicode blocks, but a block doesn’t necessarily correspond to a single script. If you follow ISO 15924’s definition of a script:

Greek (Grek) and Coptic (Copt) share the Greek and Coptic block.
Some blocks, such as Combining Diacritical Marks, don’t correspond to a specific script.
Japanese (Japn) is an umbrella script that includes Hiragana (Hira), Katakana (Kana), and Kanji (Hani) scripts.
Simplified Chinese (Hans) and Traditional Chinese (Hant) codepoints are intermingled within the various CJK blocks.

We probably could expose the data from Scripts.txt without much effort or size increase. However, I agree that we may end up needing something a bit different depending on the intended use cases.

1ec5 · 2018-02-15T05:41:09Z

allow an alternative display for poorly rendered scripts (e.g. display English labels instead)

For this use case specifically, it would be straightforward to detect codepoints that require unsupported typographic features. Instead of exposing an open-ended script lookup operator, how about a simpler operator that returns whether GL thinks it can render a given string? Then a style could combine that with the case operator to fall back to a different, more easily rendered property, such as name_en in the Mapbox Streets source.

ChrisLoer · 2018-02-28T19:43:16Z

Here's a possible implementation of "can we render this character" which captures about what our expectations are but hopefully also shows the limitations of this approach:

exports.charInRenderableScript = function(char: number, canRenderRTL: boolean) {
    // This is a rough heuristic: whether we "can render" a script
    // actually depends on the properties of the font being used
    // and whether differences from the ideal rendering are considered
    // semantically significant.

    // Even in Latin script, we "can't render" combinations such as the fi
    // ligature, but we don't consider that semantically significant.
    if (!canRenderRTL &&
        ((char >= 0x0590 && char <= 0x08FF) ||
         isChar['Arabic Presentation Forms-A'](char) ||
         isChar['Arabic Presentation Forms-B'](char))) {
        // Block out bulk of Hebrew, Arabic and other RTL scripts
        return false;
    }
    if (char >= 0x0590 && char <= 0x109F) {
        // Block out Indic and Southeast Asian scripts from Devanagari
        // to Myanmar. Note that some scripts such as Thai and Lao mainly
        // rely on relatively simple diacritic placement, and depending
        // on the font, rendering may be legible if not fully correct.
        return false;
    }
    return true;
};

ChrisLoer · 2018-04-18T00:47:17Z

Closing with #6260: we went with the more limited is-supported-script instead of trying for general purpose (and tricky to define correctly) "what script is this string" functionality.

jfirebaugh added the feature 🍏 label Dec 11, 2017

ChrisLoer mentioned this issue Mar 1, 2018

is-supported-script expression #6260

Merged

3 tasks

anandthakker mentioned this issue Apr 9, 2018

Master ticket tracking expression API completeness #6484

Open

16 tasks

ChrisLoer mentioned this issue Apr 16, 2018

Port is-supported-script expression to native mapbox/mapbox-gl-native#11693

Closed

ChrisLoer closed this as completed Apr 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an expressions string lookup operator that returns the script of the string? #5807

Add an expressions string lookup operator that returns the script of the string? #5807

nickidlugash commented Dec 5, 2017

1ec5 commented Dec 6, 2017

jfirebaugh commented Dec 14, 2017

1ec5 commented Dec 14, 2017

1ec5 commented Feb 15, 2018

ChrisLoer commented Feb 28, 2018

ChrisLoer commented Apr 18, 2018

Add an expressions string lookup operator that returns the script of the string? #5807

Add an expressions string lookup operator that returns the script of the string? #5807

Comments

nickidlugash commented Dec 5, 2017

1ec5 commented Dec 6, 2017

jfirebaugh commented Dec 14, 2017

1ec5 commented Dec 14, 2017

1ec5 commented Feb 15, 2018

ChrisLoer commented Feb 28, 2018

ChrisLoer commented Apr 18, 2018