Update design doc for message pattern quoting #503

echeran · 2023-10-27T01:59:30Z

An update to the design doc, specifically on the topic of: "Do we allow unquoted variant patterns?"

Very much a WIP while in draft mode (not suitable for "drive-by reviews" until officially ready for review).

@mihnita please take an initial look and provide suggestions.

…of whitespace handling rules for template languages

…ut frequency

…y vs. importance vs. i18n best practices

aphillips

Looking good. Thank you both for working on this (especially during "maximum crunch time". I've made a number of suggestions below. Please have a look. They are mostly editorial in nature.

exploration/text-vs-code.md

aphillips · 2023-10-27T15:16:50Z

exploration/text-vs-code.md

+Rarely do messages that need to include leading or trailing whitespace do so due to
+how they will be concatenated with other text,


I think this change made the text far less clear.

To be honest, I would have guessed that you would have replaced the somewhat biased word "Rarely" with something more neutral such as "Some messages need..."

After the resource file gets parsed as XML, the Android string resource format requiring

After the resource file gets parsed as XML, the Android resource compiler requires

somewhat biased word "Rarely" with something more neutral such as "Some messages need..."

In all fairness, 0.3% is indeed rare.
True, that number comes from an HTML oriented corpus, but I don't have access to much code using Windows / MacOS native formats.

aphillips · 2023-10-27T15:22:21Z

exploration/text-vs-code.md

+Also importantly, we cannot make assumptions about the validity of leading or trailing whitespace in a message,
+especially since their usage may be entirely unrelated to internationalization issues (ex: sentence agreement disruption by concatenation).


This doesn't seem clear to me?

I think what you might be trying to say is:

Suggested change

Also importantly, we cannot make assumptions about the validity of leading or trailing whitespace in a message,

especially since their usage may be entirely unrelated to internationalization issues (ex: sentence agreement disruption by concatenation).

Also, importantly, whether the intentional inclusion of whitespace by a

message author might be considered "desirable" or might be interpreted

as "an internationalization bug",

we need to provide the ability of an author to control the content of a given pattern without ambiguity.

I would change this whole point a bit. Something along these lines:

All common OSes (Windows, MacOS, Linux, iOS, Android) have "plain text" widgets and "rich formatting widgets" (usually Web).
The "Web widgets" usually drag with them a whole HTML engine. That is slow, and memory consumming.
So the most commonly used widgets are plain text.
And when that is all you have, spaces and newlines are used to create "fake" formatting.
Things like paragraphs, indents, lists (bulleted or numeric).

Some examples (pick and choose):

https://petri-media.s3.amazonaws.com/2021/05/Figure10-3.png

https://docs.oracle.com/cd/E19957-01/817-4220/images/SetupWizWelcome2.gif

https://www.manageengine.com/products/support-center/help/installationguide/images/installwizard.jpg

Even in HTML there are sometimes reasons to force the space preserving.

TLDR: trailing spaces are not necessarily an i18n bug, so it is not the job of MF2 to discourage them, or to get in the way.

exploration/text-vs-code.md

aphillips · 2023-10-27T15:49:58Z

exploration/text-vs-code.md

-Messages themselves are "simple strings" and must be considered to be a single
-line of text. In many containing formats, newlines will be represented as the local
-equivalent of `\n`.
+Messages themselves are "simple strings" and must be considered to be WYSIWYG.


This is incorrect and the change greatly reduces the impact of the document IMO. The thing that is WYSIWYG is the pattern. In the case of simple messages, this is the whole message. But in the case of complex messages, what you see ({#input $foo :number minimumFractionDigits=11}) is not exactly what you get 😉

The most important part of the original statement here is removed, which reminds readers that this message:

myMessage = {{ {#input $var :number} {{You have {$var} message(s)}} }}

Is actually this message in many storage formats:

myMessage = {{\n {#input $var :number}\n {{You have {$var} message(s)}}\n}}

True. But the storage format might remove spaces / newlines, and not only from the beginning

For example I can do this in properties file:

myMessage = {{\ input {$var :number}\ {{You have {$var} message(s)}}\ }}

What MF2 sees (once loaded from the file) is a single line, with leading spaces trimmed (from each line) and newlines removed:

{{input {$var :number}{{You have {$var} message(s)}}}}

It sounds like there are 2 points here, and that we can and want them both:

patterns are WYSIWYG and have no restrictions on newline or most other characters within

messages are treated as just a string ("simple strings") in the containing format

Our text for point 1 needs to be corrected to say "pattern" in cases where it incorrectly said "message".

We got to writing what we wrote because of the incorrect detail in the original text that said that messages are represented as "must be considered to be a single line of text". That phrase should not be preserved.

aphillips · 2023-10-27T15:51:46Z

exploration/text-vs-code.md

+Messages themselves are "simple strings" and must be considered to be WYSIWYG.
+The WYSIWYG nature of representing a message pattern is independent of whether the message is a single line or contains multiple lines.
+
+There is no restriction that a message must only contain a single line (that is, not contain any newline characters),


I would s/message/pattern/g this section (carefully because my suggestion is not always true). Only talk about the message when you intend to include the code.

aphillips · 2023-10-27T15:55:14Z

exploration/text-vs-code.md

+when there is 1+ declarations in a `match` (selection) message,
+or when there are 2+ declarations in a non-`match` complex message.
+
+Cons:


Consider adding that the message closing pattern characters add no value.

About closing brackets, pros and cons:

Closing brackets are ingrained in developers.
{{ something something feels broken because of the missing }}

The closing brackets might assist in some storage formats (maybe to be designed), especially some that might be minimized by tools.
Example:

msg1 = {{ some complex message }} when = Type your name here.

Minimized: msg1={{ some complex message }}when=Type your name here.

One can use the space after closing to add comments, metadata for linters or other tools.

{{ ... when ... }} lint_rules: { maxlen:"80 chars" } ref : { screenshot: "https://example.com/foo.jpg", glossaryId: 1234 }

Not strong arguments. But there is some value.
Even if all it does is prevent the "what the heck is this" reaction.

Addition to the pros (new bullet of changing an existing one): visibility (?)

"No trimming / Always delimit" also makes it clear what spaces are rendered.
Example:

{#when one} This is a message (when one) condition

vs

{{ when one { This is a message (when one) condition } }}

In the second case it is clear what is rendered: leading spaces, no matter what kind.

In the first case it is not clear.
"We trim the ASCII spaces" rule does not help, since the spaces might be non-breaking spaces, or em-space, en-space, ideographic space, and all the other characters that look like space on screen, but are not ASCII space.

So visually I don't know where the message starts / stops.
Even in edit mode (when I translate) I don't know.

A bit worse: as a translator you often get strings with no context.
There is no way to know if through what kind of API the string will go.

So if I get this string "Bill Gates did something" and I have to translate, and I want to put the "honorific space" in front of the name, I don't know how to do it.
Is the message going through MF2 with trimming? Then I have to wrap it in { Bill ....}
But maybe it is not going through MF2.
Then if I wrap it the { ... } will render on screen "as is", not what you want.

We've seen this:
"The bee" => "L'abbeile" : if the message goes through MF1, the apostrophe needs to be escaped
"I'm 1 in 100" => "Eu sunt 1%" : if the string goes through a printf-like API then I need to escape the %

And as a translator I have no why to know what API the dev uses, and I'm not familiar with the escape rules for myriads of APIs.
I am in fact faced with more APIs than a developer.
A dev might do "java + html + js".
A translator often works for many projects, from many companies, so it is exposed to strings consumed by native Windows apps, PHP, some Ruby stuff, C#, SQL, others. And even switch several times per day.

These days you don't pay the bills as a translator if you only handle one single format for one produce from one company.

+1 to @mihnita 's point about the possibility "to add comments, metadata for linters or other tools."

We can discuss as a workgroup whether we leave the space after the closing delimiter for as a free-for-all, or we reserve it for us as a standard for future extensions.

Note: we are losing the possibility to do the same for simple messages because we are moving away from our current syntax.

spec/syntax.md

stasm

I know you said this wasn't ready for reviews yet, so please consider this my early feedback on an early draft, which I'm sharing now because the day here is soon over :)

stasm · 2023-10-27T16:15:58Z

exploration/text-vs-code.md

+
+—Rico Mariani, MS Research MindSwap Oct 2003. (<a href="https://learn.microsoft.com/en-us/archive/blogs/brada/the-pit-of-success">restated by Brad Adams</a>, MS CLR and .Net team cofounder)
+</blockquote>
+</details>

 Developers and translators should be able to read and write the syntax easily in a text editor.

 Translators (and their tools) are not software engineers, so we want our syntax


I think it would be good to prioritize the requirements. For example, while I agree that we should, in general terms, make the syntax simple and robust, I would also suggest that the primary consumer of the message syntax are developers. Translators will oftentimes work with just the pattern syntax, through CAT tools.

I agree. See the "Evaluation" portion of the "Proposed Design" section below.

However, prioritizing requirements to me is tantamount to defining our value system to evaluate. We don't yet have alignment as a group on what our requirements are, let alone our prioritization of those requirements (value system). This is another reason why I wanted to keep the values (prioritization of requirements) only alongside the area explaining how we chose to propose an option via evaluation.

Before we can do anything further towards implementing your suggestion, you/we have to get the group to be self-aware and precise on their values, and then maybe after that, alignment. :-)

stasm · 2023-10-27T16:21:00Z

exploration/text-vs-code.md

+Within a complex message, patterns are always quoted with `{{...}}` or other choice of delimiter.
+
+The entire complex message is also wrapped with `{{...}}` or other choice of delimiter.
+This allows interior "code mode" of message to have flexible whitespace in between tokens


What do you think should happen to whitespace outside the entire code block? Should we specify it here as well?

We think it should be specified, it not already.
(will probable be covered anyway when we get to update the ebnf)

My choice would be to say:

nothing before {#. If there is something, then we are in simple text mode, and {# will be an error

after closing #} we have several options:

no closing, we drop it as a requirement

closing is optional. If as a developer you are bothered to see unclosed brackets, feel free to close it. Does nothing

closing is mandatory, and the message ends there

nothing allowed after is => unnecessarily rigid?

allowed only spaces / newlines after it

reserve it for us. We can extend the standard later to add comments, lint directives, links to images, etc

allow for developers to do what they want, "free for all". The message ended, we ignore the rest.
They can add comments, lint directives, links to images, whatever

On the WG to discuss and decide.
I like the idea to reserve it for us.
It is a non-breaking change.

I would like to initially say anything outside the complex message-wide delimiters is an error (invalid). In the future, if we want to relax requirements and say annotations & message description notes are allowed in message syntax, you have the freedom to do so (after the closing delimiter).

In general with API design, you can always relax requirements and narrow outputs, but the reverse is not possible (causes breaking changes).

If there is anything "outside" the complex message-wide delimiters, then:

It's whitespace that we trim
a. Or produces an error (worth discussion)

or the message is actually a simple message that produces a lot of errors

This {{ match {$var} when * {{{$var}}}}} is an interesting message.

This evaluates, I think, as:

This {�} is an interesting message.

Or possibly (with $var==123) as:

This {�}{�} match 123 when * 123{�}{�} is an interesting message.

(both emit a syntax error)

stasm · 2023-10-27T16:26:27Z

exploration/text-vs-code.md

+* The rule about the whether leading and trailing whitespace is included is simple and unambiguous.
+* This matches the WYSIWIG behavior that simple messages preserve.
+* The patterns can be detected within the pattern more easily due to the delimiters serving as a visual anchor.
+* Requiring all patterns to be quoted minimizes the number of characters that need to be escaped within a pattern to 3:


I think this also holds for at least some of the other proposals.

In fact, if we always quote variant patterns, we must make the closing delimiter special. Theoretically, in other proposals, we could only special-case the opening delimiter, because both code and placeholders would be wrapped in the same delimiter. (Although that would require agreeing to a different way of preserving whitespace than the currently agreed {{ ... }}.)

In fact, if we always quote variant patterns, we must make the closing delimiter special.

But we have to do it for all the other proposals too.
Because all allow for optional wrapping.

I think this also holds for at least some of the other proposals.

You're right, it's true for 1a and 2a. Not for 3a. Since it's not a universal aspect of the proposals, it's not a wash (redundant), and thus a point worth mentioning.

I'm open to suggestions for rephrasing "Requiring all patterns" to whatever it is that results in the minimal possible of characters needing escaping.

stasm · 2023-10-27T16:32:06Z

exploration/text-vs-code.md

+while complex messages use the aforementioned delimiter to quote patterns (ex: `{{...}}`).
+* Another potential drawback, specifically in the case of non-`match` complex messages with exactly 1 declaration,
+is that this option adds 2 extra delimiters compared to an alternative syntax that doesn't require quoted patterns
+and is designed to minimize delimiter usage only to code mode introducers.


We should add some of the other previously discussed drawbacks:

If the code block is delimited from both sides, users may be tempted to insert text around it.

If the code block is delimited from both sides, it may be easy to forget the }} closing the entire block.

If we use curlies for patterns and for placeholders, then they serve double duty, which may make the syntax harder to understand, and also harder to make the pattern out visually.

adding text after closing is not necessarily a drawback (I have a comment somewhere)

If the code block is delimited from both sides, it may be easy to forget the }} closing the entire block.

Fair enough.
Maybe make it optional?

I think that {{ is to make this intentionally ugly. I would probably go with {# to enter code mode, and { to wrap the patterns.
And yes, it is double douty. But choosing anything else means that we escape one extra thing.
If we use <<< and I want (for some reason <<< Hello world as simple text, then we need to escape it.

So we add another escaping rule.
Pros and cons :-)

If the code block is delimited from both sides, users may be tempted to insert text around it.

Not a moving argument to me. As you pointed out in a different comment, the audience of message authoring is developers. If we say it's not possible to add extraneous text, and they do so and the MF2 implementation rejects the message, they'll figure it out quickly.

If the code block is delimited from both sides, it may be easy to forget the }} closing the entire block.

Forgetting to type syntax can happen in any alternative option, so this feels like a weaker argument than the previous. Also, even though developers learn instincts early on to always balance delimiters, we create linters & other tooling to double-check.

(Some languages that use delimiters in a simple and regular way have the ability to evolve powerful tools that always keep everything balanced while being easy to use -> easy to do the right thing, impossible to do the wrong thing. But I digress...)

If we use curlies for patterns and for placeholders, then they serve double duty, which may make the syntax harder to understand, and also harder to make the pattern out visually.

Sure. Option 3a solves for that with the tradeoff of taking on other costs as a result. The next question then becomes how do we prioritize our requirements in order to create a value system that we use to evaluate the tradeoffs (choose)?

If the code block is delimited from both sides, it may be easy to forget the }} closing the entire block.
Forgetting to type syntax can happen in any alternative option, so this feels like a weaker argument than the previous.

I disagree a bit. Any enclosing syntax is prone to errors, but the argument here is that the more levels of nesting (and the further apart the enclosure bits), the more opportunities for error exist because the user is keeping track of more things.

Note that one of the proposals for "2a" was to have just a starter sigil. This would eliminate the enclosure for the message.

stasm · 2023-10-27T16:34:22Z

exploration/text-vs-code.md

+
+Cons:
+
+* This comes at the cost of an inconsistency in the WYSIWYG patterns are quoted between simple and complex messages.


We should discuss here the risk of "two syntaxes in a trench coat." For someone who's only ever seen simple messages, the only syntax rule they can infer is that {} is used for placeholders. A whole separate complex-message mode cannot be easily "guessed".

(But see also my comment about considering developers the primary audience for the complex message syntax, so perhaps this is acceptable.)

"two syntaxes in a trench coat."

The expression is designed to sounds ugly :-)

But there are already templating systems doing this, and didn't prevent adoption, or trigger many complaints.
Heck, when I write HTML with stylesheets and code I have 3 syntaxes in trench coat :-)

Or if you write C/C++/C# something else, you have one syntax in the main code, another syntax in strings, and another in printf strings, etc.

int foo = 10%3; // syntax one, math, result is 1 puts("10%3"); // syntax 2, in string, output, result is "10%3" printf("10%3"); // syntax 3, in string, error, need to double the `%%` to get "10%3" in output

stasm · 2023-10-27T16:35:53Z

exploration/text-vs-code.md

+our value system places to the requirements met by the pro aspects compared to the con aspects. Namely:
+
+* [high] Unsurprising WYSIWYG behavior from patterns
+* [high] Easy recognition of patterns, even for non-developers


To be fair, I've never found ICU MF patterns easy to spot—specifically because they use {}.

Fair point.

But I've never seen any questions asked for single selections.
(on StackOverflow or internally sites, here and in previous companies)

It gets ugly, even for developers, when you have multiple selections (plural-in-plural, select in plural in ???)

stasm · 2023-10-27T16:42:53Z

exploration/text-vs-code.md

+...
+	<?php
+		if (true) {
+			echo '<p>Hello World</p>';


Just a nit, but I think both PHP and Freemarker would actually prefer the second method:

<?php if (true): ?> <p>Hello, world!</p> <?php endif ?>

I didn't even know that is possible.

If I check https://www.php.net/manual/en/control-structures.if.php and https://www.w3schools.com/php/php_if_else.asp the option is not even mentioned.

It is mentioned in the "User Contributed Notes" of the manual.

But I kind of doubt that something that is not even mentioned in the official manual is the preferred way.

If anything this makes the point that having more than one way to do things, some more recommended than others, is not a good thing.

mihnita · 2023-10-27T16:07:36Z

exploration/text-vs-code.md

+Rarely do messages that need to include leading or trailing whitespace do so due to
+how they will be concatenated with other text,


After the resource file gets parsed as XML, the Android string resource format requiring

After the resource file gets parsed as XML, the Android resource compiler requires

mihnita · 2023-10-27T16:10:51Z

exploration/text-vs-code.md

+Rarely do messages that need to include leading or trailing whitespace do so due to
+how they will be concatenated with other text,


somewhat biased word "Rarely" with something more neutral such as "Some messages need..."

In all fairness, 0.3% is indeed rare.
True, that number comes from an HTML oriented corpus, but I don't have access to much code using Windows / MacOS native formats.

mihnita · 2023-10-27T16:24:01Z

exploration/text-vs-code.md

+Also importantly, we cannot make assumptions about the validity of leading or trailing whitespace in a message,
+especially since their usage may be entirely unrelated to internationalization issues (ex: sentence agreement disruption by concatenation).


I would change this whole point a bit. Something along these lines:

All common OSes (Windows, MacOS, Linux, iOS, Android) have "plain text" widgets and "rich formatting widgets" (usually Web).
The "Web widgets" usually drag with them a whole HTML engine. That is slow, and memory consumming.
So the most commonly used widgets are plain text.
And when that is all you have, spaces and newlines are used to create "fake" formatting.
Things like paragraphs, indents, lists (bulleted or numeric).

Some examples (pick and choose):

https://petri-media.s3.amazonaws.com/2021/05/Figure10-3.png

https://docs.oracle.com/cd/E19957-01/817-4220/images/SetupWizWelcome2.gif

https://www.manageengine.com/products/support-center/help/installationguide/images/installwizard.jpg

Even in HTML there are sometimes reasons to force the space preserving.

TLDR: trailing spaces are not necessarily an i18n bug, so it is not the job of MF2 to discourage them, or to get in the way.

mihnita · 2023-10-27T16:30:53Z

exploration/text-vs-code.md

-Messages themselves are "simple strings" and must be considered to be a single
-line of text. In many containing formats, newlines will be represented as the local
-equivalent of `\n`.
+Messages themselves are "simple strings" and must be considered to be WYSIWYG.


True. But the storage format might remove spaces / newlines, and not only from the beginning

For example I can do this in properties file:

myMessage = {{\ input {$var :number}\ {{You have {$var} message(s)}}\ }}

What MF2 sees (once loaded from the file) is a single line, with leading spaces trimmed (from each line) and newlines removed:

{{input {$var :number}{{You have {$var} message(s)}}}}

mihnita · 2023-10-27T16:32:14Z

exploration/text-vs-code.md

+("Simple messages" refers to messages consisting solely of a pattern, and thus are not complex messages.)
+
+Because the simple message pattern consists of the entire message,
+the pattern includes any leading or trailing whitespace.


and newlines

Already included.

s = 1*( SP / HTAB / CR / LF )

mihnita · 2023-10-27T16:45:36Z

exploration/text-vs-code.md

+when there is 1+ declarations in a `match` (selection) message,
+or when there are 2+ declarations in a non-`match` complex message.
+
+Cons:


About closing brackets, pros and cons:

Closing brackets are ingrained in developers.
{{ something something feels broken because of the missing }}

The closing brackets might assist in some storage formats (maybe to be designed), especially some that might be minimized by tools.
Example:

msg1 = {{ some complex message }} when = Type your name here.

Minimized: msg1={{ some complex message }}when=Type your name here.

One can use the space after closing to add comments, metadata for linters or other tools.

{{ ... when ... }} lint_rules: { maxlen:"80 chars" } ref : { screenshot: "https://example.com/foo.jpg", glossaryId: 1234 }

Not strong arguments. But there is some value.
Even if all it does is prevent the "what the heck is this" reaction.

mihnita · 2023-10-27T16:55:48Z

exploration/text-vs-code.md

+when there is 1+ declarations in a `match` (selection) message,
+or when there are 2+ declarations in a non-`match` complex message.
+
+Cons:


Addition to the pros (new bullet of changing an existing one): visibility (?)

"No trimming / Always delimit" also makes it clear what spaces are rendered.
Example:

{#when one} This is a message (when one) condition

vs

{{ when one { This is a message (when one) condition } }}

In the second case it is clear what is rendered: leading spaces, no matter what kind.

In the first case it is not clear.
"We trim the ASCII spaces" rule does not help, since the spaces might be non-breaking spaces, or em-space, en-space, ideographic space, and all the other characters that look like space on screen, but are not ASCII space.

So visually I don't know where the message starts / stops.
Even in edit mode (when I translate) I don't know.

mihnita · 2023-10-27T17:12:09Z

exploration/text-vs-code.md

+when there is 1+ declarations in a `match` (selection) message,
+or when there are 2+ declarations in a non-`match` complex message.
+
+Cons:


A bit worse: as a translator you often get strings with no context.
There is no way to know if through what kind of API the string will go.

So if I get this string "Bill Gates did something" and I have to translate, and I want to put the "honorific space" in front of the name, I don't know how to do it.
Is the message going through MF2 with trimming? Then I have to wrap it in { Bill ....}
But maybe it is not going through MF2.
Then if I wrap it the { ... } will render on screen "as is", not what you want.

We've seen this:
"The bee" => "L'abbeile" : if the message goes through MF1, the apostrophe needs to be escaped
"I'm 1 in 100" => "Eu sunt 1%" : if the string goes through a printf-like API then I need to escape the %

And as a translator I have no why to know what API the dev uses, and I'm not familiar with the escape rules for myriads of APIs.
I am in fact faced with more APIs than a developer.
A dev might do "java + html + js".
A translator often works for many projects, from many companies, so it is exposed to strings consumed by native Windows apps, PHP, some Ruby stuff, C#, SQL, others. And even switch several times per day.

These days you don't pay the bills as a translator if you only handle one single format for one produce from one company.

Co-authored-by: Addison Phillips <[email protected]>

exploration/text-vs-code.md

echeran · 2023-10-27T21:22:03Z

exploration/text-vs-code.md

+Within a complex message, patterns are always quoted with `{{...}}` or other choice of delimiter.
+
+The entire complex message is also wrapped with `{{...}}` or other choice of delimiter.
+This allows interior "code mode" of message to have flexible whitespace in between tokens


I would like to initially say anything outside the complex message-wide delimiters is an error (invalid). In the future, if we want to relax requirements and say annotations & message description notes are allowed in message syntax, you have the freedom to do so (after the closing delimiter).

In general with API design, you can always relax requirements and narrow outputs, but the reverse is not possible (causes breaking changes).

echeran · 2023-10-27T21:41:46Z

exploration/text-vs-code.md

+* The rule about the whether leading and trailing whitespace is included is simple and unambiguous.
+* This matches the WYSIWIG behavior that simple messages preserve.
+* The patterns can be detected within the pattern more easily due to the delimiters serving as a visual anchor.
+* Requiring all patterns to be quoted minimizes the number of characters that need to be escaped within a pattern to 3:


I think this also holds for at least some of the other proposals.

You're right, it's true for 1a and 2a. Not for 3a. Since it's not a universal aspect of the proposals, it's not a wash (redundant), and thus a point worth mentioning.

I'm open to suggestions for rephrasing "Requiring all patterns" to whatever it is that results in the minimal possible of characters needing escaping.

echeran · 2023-10-27T21:49:49Z

exploration/text-vs-code.md

+when there is 1+ declarations in a `match` (selection) message,
+or when there are 2+ declarations in a non-`match` complex message.
+
+Cons:


+1 to @mihnita 's point about the possibility "to add comments, metadata for linters or other tools."

We can discuss as a workgroup whether we leave the space after the closing delimiter for as a free-for-all, or we reserve it for us as a standard for future extensions.

Note: we are losing the possibility to do the same for simple messages because we are moving away from our current syntax.

echeran · 2023-10-27T22:58:21Z

exploration/text-vs-code.md

+while complex messages use the aforementioned delimiter to quote patterns (ex: `{{...}}`).
+* Another potential drawback, specifically in the case of non-`match` complex messages with exactly 1 declaration,
+is that this option adds 2 extra delimiters compared to an alternative syntax that doesn't require quoted patterns
+and is designed to minimize delimiter usage only to code mode introducers.


If the code block is delimited from both sides, users may be tempted to insert text around it.

Not a moving argument to me. As you pointed out in a different comment, the audience of message authoring is developers. If we say it's not possible to add extraneous text, and they do so and the MF2 implementation rejects the message, they'll figure it out quickly.

If the code block is delimited from both sides, it may be easy to forget the }} closing the entire block.

Forgetting to type syntax can happen in any alternative option, so this feels like a weaker argument than the previous. Also, even though developers learn instincts early on to always balance delimiters, we create linters & other tooling to double-check.

(Some languages that use delimiters in a simple and regular way have the ability to evolve powerful tools that always keep everything balanced while being easy to use -> easy to do the right thing, impossible to do the wrong thing. But I digress...)

If we use curlies for patterns and for placeholders, then they serve double duty, which may make the syntax harder to understand, and also harder to make the pattern out visually.

Sure. Option 3a solves for that with the tradeoff of taking on other costs as a result. The next question then becomes how do we prioritize our requirements in order to create a value system that we use to evaluate the tradeoffs (choose)?

exploration/text-vs-code.md

eemeli · 2023-10-29T13:28:14Z

On my part, I've spent much more time than I'd wish over the past few days and weeks thinking about and looking into messages with external whitespace. In the absence of any better place to put down some of my thoughts on this, these are the aspects and arguments that I find important to account for:

Localizable external whitespace is really rare, while mistakes are common

Sometimes a leading or trailing space could be localizable, but you can't necessarily tell without looking through the code which is using the message. So I did that. This is what I'd mentioned previously via email:

In total, [Mozilla's Pontoon] system has so far handled about 175k translatable messages, of which 568 had exterior whitespace in the source locale. These I've manually categorised as follows, starting from the most common:

180 incorrect segmentation / string concatenation, e.g. " for this version." or "Download the "

122 contents wrapped in a <tag>, so the whitespace is almost certainly an insignificant segmentation artifact.

108 ends in colon+space, such as "Description: "

106 unlocalized markup, such as leading or trailing newlines

52 potentially localizable space between clauses

So overall that's perhaps 0.3% of all messages, of which (generously) up to a third might be localizable. If you'd like to perform your own analysis, please reach out to me separately for the source data.

As a next step, I filtered all of the above to the 41 potentially localizable strings which are currently in production (looks like a sentence, has one leading or trailing space), and found where they're coming from in code, and how they're used: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40.

Of the above, only the 0th message actually contains a localizable space; the other 40 are all bugs. So that's exactly one localizable external space in about 66k messages currently in production. This space was incorrectly dropped in 15 of the 35 locales to which it's translated; I've now submitted corrections for all of them.

Real localizable external whitespace is so rare that it gets drowned by the noise.

All of the above bugs are in formats that explicitly delimit patterns, and thereby make it too easy to include leading or trailing whitespace. About 28% of the messages currently in production use Fluent, which does not quote patterns. None of these were similarly buggy.

From this I would conclude that using a syntax which requires external whitespace to be explicitly intentional would make it much less likely for it to be ignored, and would lead to better localizations.

We're not actually talking about quoting patterns

To be precise, we are talking about delimiting patterns with {braces}, which rather explicitly are not 'quotes' or "quotes". The distinction here matters, because we are looking to assign a novel meaning to a pair of characters that no other syntax than ICU MessageFormat uses to delimit localizable text. Every other syntax which uses braces in text uses them to delimit code. Which MF2 is also doing.

When we talk of the syntax being "WYSIWYG", we are asking for its readers to not see the {braces}, a symbol so prevalent in our syntax that we've jokingly incorporated it into our logo {�}. Humans are not trained for that the way they are with "quotes", or with empty spaces acting as content separators. In other words, if I see the braces but I don't see the empty space, how is the syntax "WYSIWYG"?

We are also asking for MF2 authors and editors to somehow know that the spaces within the { braces } are significant -- but only if they're delimiting patterns. Within expressions, MF2 syntax ignores whitespace, so {{ Hello {world }}} formats to contain a leading space, but no trailing space. I mean, go back to the first sentence of this paragraph and consider how you saw the " braces ": Was it truly obvious that its padding spaces were a part of the string it represents the same way they are with the double quotes?

The overlap of external whitespace and variants is really truly tiny

In a very real sense, the discussion of whether messages with variants should always be delimited is a discussion asking if this string is ok being represented as

{#match :platform}
{#when macos} ⇧ ⌘{| |}
{#when *} Ctrl+Shift+

or if that is sufficiently problematic that we need every pattern of every message with variants to be {{delimited}}.

I have been actively seeking for examples of messages with leading or trailing whitespace for the last month, and the above is the one actual, current message with variants and external whitespace that has been identified. We are talking about such a rare situation it should not be driving our whole syntax.

We should choose to do with patterns what we're doing with literals, where for common values we allow them to be delimited by whitespace, but also permit |vertical pipes| to be used as "quotes" in the rare cases where they're required.

Fast tracked from #503.

aphillips · 2023-10-29T15:32:10Z

@eemeli Thanks for the long comment.

To be precise, we are talking about delimiting patterns with {braces}, which rather explicitly are not 'quotes' or "quotes".

I agree that we are talking about delimiting patterns and that this is what our technical decision is about. Quoting would be one mechanism for delimiting patterns, but is not the only one. I would call out that quoting does not require the use of "quote" characters. I think that referring to {{ and }} as pattern quotes is fine (although we could be pedantic and call them "pattern delimiters" if you prefer)

The overlap of external whitespace and variants is really truly tiny

I think this isn't quite on the nose. The real problem we're dealing with here is intentionality. There has to be a way for users to intentionally include various kinds of character sequence into their pattern. This includes invisible Unicode whitespace that is not MF2 whitespace. For example, non-breaking space or NNBSP or ZWNJ or what have you. There are a lot of characters that have no "ink" but which a user might intend to be part of the message. Some of those characters will be MF2 whitespace.

MF2 intentionally does not include a general purpose character escaping mechanism (because we expect the host environment or file format to include one and we are avoiding the double-escaping mess). If the boundary between "pattern" and "not pattern" is between invisible characters, that's pretty difficult to work with.

I think there are four audiences that need to be served:

Developers/message authors. The creator of the source message needs to understand how the pattern is delimited and be able to clearly include any character sequence into a pattern.
Translators. Translators, like developers, need to know whether whitespace is part of the pattern or not and to intentionally include whitespace into patterns where necessary.
MF2 parsers. The parser obviously need to be able to determine the boundary for the pattern if the message is to be rendered.
Tools. CAT tools could treat the entire message as a single segment. But they could also produce separate segments for each when case (including generating necessary additional segments for the target locale). Either way, tools generally protect syntax to allow translators to focus on content. Non-CAT tools also need to preserve intentional whitespace and not interfere with pattern delimiters.

What I think is interesting is that pattern delimiters are probably syntax. If pattern delimiters are optional, it might be unclear to translators whether a given pattern is already quoted (delimited) or to tools as to whether delimiters are needed. It's hard for machines to guess people's intentions. I agree that PEWS is rare, which is why we need to be especially clear about how to handle the rare cases.

eemeli · 2023-10-29T18:19:56Z

[...] I think that referring to {{ and }} as pattern quotes is fine (although we could be pedantic and call them "pattern delimiters" if you prefer)

I don't want to insist on pedantry, as long as we recognise that the pattern "quote" characters we're considering are rather explicitly not generally used as quote characters. For anyone coming to MF2 not via MF1, this will be an additional weird thing to learn. Very much comparable to |quoted literals|, which I've heard argued not to be so bad because they'll be so rare.

[...] There has to be a way for users to intentionally include various kinds of character sequence into their pattern. This includes invisible Unicode whitespace that is not MF2 whitespace. For example, non-breaking space or NNBSP or ZWNJ or what have you. There are a lot of characters that have no "ink" but which a user might intend to be part of the message. Some of those characters will be MF2 whitespace.

Yes, and this will be supported no matter which way we decide to go, by the optional {{pattern quoting}} ability. Which unfortunately will mean that the closing brace } may need to be considered a syntax character, and require special escaping. Pontoon has seen 17 messages with } as a pattern character, an order of magnitude more than messages with localizable external whitespace.

…s, prev info

echeran · 2023-10-30T04:10:07Z

Ready for review now. I made edits mostly based on #504, with some additions, and also responding comments that came prior to being ready anyways. Previous info from the doc (background, use cases, stats, examples) is kept pre-hidden at the bottom, like appendices.

exploration/text-vs-code.md

eemeli · 2023-10-30T07:58:27Z

exploration/text-vs-code.md

+>{{
+>   match {$var}
+>   {when *} This pattern has a space in front (it's between \} and This)
+>   {when other}
+>      This pattern has a newline and six spaces in front of it
+>   {when moo}This pattern has no spaces in front of it, but an invisible space at the end
+>}}


Suggested change

>{{

> match {$var}

> {when *} This pattern has a space in front (it's between \} and This)

> {when other}

> This pattern has a newline and six spaces in front of it

> {when moo}This pattern has no spaces in front of it, but an invisible space at the end

>}}

>{match {$var}}

>{when *} This pattern has a space in front (it's between } and This)

>{when other}

> This pattern has a newline and six spaces in front of it

>{when moo}This pattern has no spaces in front of it, but an invisible space at the end

eemeli · 2023-10-30T08:00:05Z

exploration/text-vs-code.md

+
+Pros:
+- WYSIWYG (on steroids)
+


Suggested change

- Avoids as many escape sequences as possible,

as `}` does not need escaping in patterns.

eemeli · 2023-10-30T08:03:36Z

exploration/text-vs-code.md

+- Probably not a serious alternative: the example
+  includes any number of obvious footguns that have to be addressed


This seems really rather opinionated. In many ways, this is the same as the "Always quote" solution, except that the pattern delimiters are }…{ instead of {{…}}. So the unnamed footguns probably apply to that alternative as well.

It is opinionated. When I wrote it, I was being lazy by not enumerating the issues.

this is the same as the "Always quote" solution, except that the pattern delimiters are }…{ instead of {{…}}. So the unnamed footguns probably apply to that alternative as well.

No, this is incorrect.

{{...}} encloses all and only the whitespace that is intentional in the pattern, with {{ and }} forming the pattern boundary. These boundary characters are visible.

}...{ makes all whitespace in the variant block meaningful. It effectively prohibits a multiline representation of a message, because the newlines are always meaningful. It also means that trailing spaces (which are invisible) have meaning.

To make a message multiline, you have to put the whitespace inside the key.:

{{ match {$var} {when 0 }This has no newline or space.{ when one} This has a newline at the start.{ when *} This has a space at the start and six spaces and a newline at the end. }}

If we take your example above, and after the match replace each } with {{ and each { with }}, we get this:

{{ match {$var} }}when 0 {{This has no newline or space.}} when one{{ This has a newline at the start.}} when *{{ This has a space at the start and six spaces and a newline at the end. }}

Ignoring the specifics of what's happening with the preamble, that seems pretty similar to me. It's just that we're conditioned to look at the { and } a certain way.

eemeli · 2023-10-30T08:27:44Z

exploration/text-vs-code.md

+Cons:
+- Requires one of the alternate syntaxes
+- Has two ways to represent a pattern.
+- May be difficult for translators to add quotes when needed.


As far as I've been able to determine, there are exactly three scenarios in which a translator may need to add leading or trailing spaces to a pattern that starts out without them:

When translating a whole-sentence message from a CJK script to a non-CJK script, such that the sentences are concatenated into a single paragraph and need spaces between them. As with all other string concatenations, I would expect for this to be explicitly called out to the translator, so that they may know whether to add the space at the start or end of the pattern.

When translating a pattern to Chinese which ends up requiring a leading honorific space. As far as I can tell, this is really rare in dynamic message strings.

When the message is expected to be output using a monospace font and fakes either centering or right-alignment by using in-message spaces for indentation, and the first line of the pattern happens to be exactly the maximum length, and so does not need leading spaces. This is sufficiently rare that I'm pretty sure this is only a theoretical possibility, and in any case I'd expect it to be rather clearly called out to the translator.

Given that each of the above only has an impact on the pattern delimiting if the message also has multiple variants and if the translator is not using any tooling that'd take care of the delimiting and if the developer has not pre-emptively delimited the pattern, I would be ok accepting this negative, especially as the downside would be a single missing space in the translation.

eemeli · 2023-10-30T08:30:40Z

exploration/text-vs-code.md

+- Easy to use (best of both worlds?)
+
+Cons:
+- Requires one of the alternate syntaxes


Given that we just ran a "beauty contest" in which these "alternate syntaxes" were preferred over the current main syntax by an absolute majority of the participants, this could also be listed as one of the "Pros".

eemeli · 2023-10-30T08:31:37Z

exploration/text-vs-code.md

+Pros:
+- Code is special, whitespace is not.
+- Makes PEWS into a "special event", alerting developers to the non-I18N aspects of it?
+


Suggested change

- Avoids as many escape sequences as possible,

as `}` does not need escaping in patterns.

It needs escaping if it is used as ending of the "wrapping" of the string

exploration/text-vs-code.md

Co-authored-by: Eemeli Aro <[email protected]>

echeran · 2023-10-30T15:42:23Z

The distinction here matters, because we are looking to assign a novel meaning to a pair of characters that no other syntax than ICU MessageFormat uses to delimit localizable text. Every other syntax which uses braces in text uses them to delimit code. Which MF2 is also doing.

Remember our previous discussions from last year about this. This MF2 group chose curly braces precisely because they are the least likely to occur in other syntaxes and in message patterns themselves.

aphillips · 2023-10-30T16:25:41Z

For the purposes of reading the document in our 2023-10-30 call, I'm merging this work now. This does not make the comments above right/wrong, relevant/irrelevant, or anything else. It's just to enable "easy reading".

- Rename the design doc. - Cross out rejected options 2 and 5 - Add notes to 2 and 5 calling this out (other changes may be added from the previous thread in #503 and WG call notes from 2023-10-30)

* Prepare design doc ahead of balloting - Rename the design doc. - Cross out rejected options 2 and 5 - Add notes to 2 and 5 calling this out (other changes may be added from the previous thread in #503 and WG call notes from 2023-10-30) * Prepare balloting instructions * Update exploration/delimiting-variant-patterns.md Co-authored-by: Eemeli Aro <[email protected]> * Apply suggestions from code review Co-authored-by: Eemeli Aro <[email protected]> --------- Co-authored-by: Eemeli Aro <[email protected]>

echeran added 14 commits October 25, 2023 21:27

typo

2ce9726

Contextualize templating libraries

4c0cfc6

Differentiate template rules from output format rules, give examples …

40ad868

…of whitespace handling rules for template languages

Update containing format escape rules interaction and assumptions abo…

a8c76cd

…ut frequency

Move usage stats from Background to Use Cases, contextualize frequenc…

506ce65

…y vs. importance vs. i18n best practices

De-emphasize overstated developer-only concern

cb16e71

Fix design tenet wording using quotes from noteworthy prior art

1417fd4

Update rest of Requirements

b7870d4

Update Constraints

2d8cf0a

Remove duplicate section not properly deleted after copied for a move

e86fa48

Update Design area (proposed, alternatives, simple message consensus)

0579b4a

Update title and objective to reflect focus of discussion

9450758

Wordsmithing and formatting

17f1d46

Update contributors list

0e169f8

echeran requested a review from mihnita October 27, 2023 01:59

aphillips requested changes Oct 27, 2023

View reviewed changes

stasm reviewed Oct 27, 2023

View reviewed changes

mihnita reviewed Oct 27, 2023

View reviewed changes

Apply suggestions from code review

2c7f977

Co-authored-by: Addison Phillips <[email protected]>

echeran commented Oct 27, 2023

View reviewed changes

Apply suggestions from code review

8e0b852

aphillips added a commit that referenced this pull request Oct 29, 2023

Fix typo

3d4b559

Fast tracked from #503.

Rewrite largely based on PR unicode-org#504; keep priorities, non-req…

cdcfdd5

…s, prev info

echeran marked this pull request as ready for review October 30, 2023 04:05

eemeli reviewed Oct 30, 2023

View reviewed changes

Useful addition

5e9b473

Co-authored-by: Eemeli Aro <[email protected]>

Update exploration/text-vs-code.md

fe5852c

Co-authored-by: Eemeli Aro <[email protected]>

aphillips merged commit a07d972 into unicode-org:main Oct 30, 2023
1 check passed

aphillips mentioned this pull request Oct 30, 2023

Should we allow unquoted variant patterns? #504

Closed

aphillips mentioned this pull request Oct 30, 2023

Prepare design doc ahead of balloting #506

Merged

eemeli mentioned this pull request Oct 31, 2023

BALLOT for "handling delimiting of patterns in complex messages" #505

Closed

aphillips mentioned this pull request Oct 31, 2023

Discussion Thread for Delimiting Pattern Boundaries #507

Closed

		Rarely do messages that need to include leading or trailing whitespace do so due to
		how they will be concatenated with other text,

		Also importantly, we cannot make assumptions about the validity of leading or trailing whitespace in a message,
		especially since their usage may be entirely unrelated to internationalization issues (ex: sentence agreement disruption by concatenation).

-Also importantly, we cannot make assumptions about the validity of leading or trailing whitespace in a message,
-especially since their usage may be entirely unrelated to internationalization issues (ex: sentence agreement disruption by concatenation).
+Also, importantly, whether the intentional inclusion of whitespace by a
+message author might be considered "desirable" or might be interpreted
+as "an internationalization bug",
+we need to provide the ability of an author to control the content of a given pattern without ambiguity.


		Cons:

		* This comes at the cost of an inconsistency in the WYSIWYG patterns are quoted between simple and complex messages.


	- Avoids as many escape sequences as possible,
	as `}` does not need escaping in patterns.

		- Probably not a serious alternative: the example
		includes any number of obvious footguns that have to be addressed

Update design doc for message pattern quoting #503

Update design doc for message pattern quoting #503

Conversation

echeran commented Oct 27, 2023

aphillips left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stasm left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mihnita Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mihnita Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mihnita Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eemeli commented Oct 29, 2023

Localizable external whitespace is really rare, while mistakes are common

We're not actually talking about quoting patterns

The overlap of external whitespace and variants is really truly tiny

aphillips commented Oct 29, 2023

eemeli commented Oct 29, 2023

echeran commented Oct 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echeran commented Oct 30, 2023

aphillips commented Oct 30, 2023

stasm left a comment •

edited

Loading

mihnita Oct 27, 2023 •

edited

Loading

mihnita Oct 27, 2023 •

edited

Loading

mihnita Oct 27, 2023 •

edited

Loading