Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion Thread for Delimiting Pattern Boundaries #507

Closed
aphillips opened this issue Oct 31, 2023 · 59 comments
Closed

Discussion Thread for Delimiting Pattern Boundaries #507

aphillips opened this issue Oct 31, 2023 · 59 comments

Comments

@aphillips
Copy link
Member

aphillips commented Oct 31, 2023

We are currently balloting this issue in #505. Use this issue for any technical discussion or questions.

Some useful links:

@aphillips
Copy link
Member Author

@sffc wrote:


Another strong "pro" of option 1 is that it should allow us to elide the quotations around code mode if we introduce a sigil. For example:

Simple text mode message:

Hello {$user}!

Code mode message, identified due to the leading sigil:

#input {$num :number}
{This is the {$num} pattern}

More complex code mode message, also identified due to the leading sigil:

#match {$num :plural}
when 0   {This is the zero pattern}
when one {This is the {$num} pattern}
when *   {These are the {$num} patterns}

With optional quoting (option 3), # becomes a character that needs escaping. With required quoting, # needs to be escaped only when it is the first character of a message.

@gibson042
Copy link
Collaborator

gibson042 commented Oct 31, 2023

With optional quoting (option 3), # becomes a character that needs escaping. With required quoting, # needs to be escaped only when it is the first character of a message.

I don't think I understand what you mean here; how is privileging # in initial position superior to using {{?

@aphillips
Copy link
Member Author

@gibson042 (noting that the comment is actually Shane's and I just moved it here)

I think @sffc is thinking of the 2a syntax variant that has an opening sigil instead of enclosing sigils. If we changed from enclosing {{...}}, that would make messages cleaner and remove the need for closing }} for the message.

Compare the #match version above to current:

{{
match {$num :plural}
when 0   {{This is the zero pattern}}
when one {{This is the {$num} pattern}}
when *   {{These are the {$num} patterns}}
}} <-- these are superfluous but it feels wrong to omit

We imposed the deliberately less attractive {{...}} as part of the agreement to adopt 2a-but-ugly in the 2023-10-23 call. But I would be unsurprised if, after the balloting, we removed any remaining ugliness from 2a (if we keep the syntax as it is). This would be one candidate for doing that. Note that we might use a double-sigil to make the need for escaping even rarer.

(For those unfamiliar with recent MF2 history, "2a" does not refer to balloting options, but to syntax options in a previous call for consensus)

@gibson042
Copy link
Collaborator

gibson042 commented Oct 31, 2023

Use this issue for any technical discussion or questions.

I think the below commentary falls under this umbrella, but if not then feel free to instead spawn a new issue.

Accepting simple messages as defined in Delimiting of Patterns in Complex Messages requires starting in "text mode" rather than "code mode" (cf. Syntax Variation Beauty Contest), and therefore looking for declarations and/or selectors to identify a complex message. Declarations in particular will often be followed by line feed, as they are in beauty contest examples. And it's not captured in the latest requirements, but I feel strongly that iterative updates to a message which change its classification (e.g., adding or removing declarations) should not affect treatment of leading or trailing whitespace—cf. #474 (comment)

 logOutMessage = ```
-{%input username}
-
-Log out {$username}?
+Log out?
 ```;

The relevance to this discussion is that satisfying that requirement would rule out designs in which leading/trailing whitespace is preserved in a simple message but trimmed in [selector-free] complex message patterns or vice versa. I have separate concerns about preserving leading/trailing whitespace at all (which is hostile to using a line-feed-terminated plain text file as the container format)1, but AFAICT that's out of scope for #505.

Footnotes

  1. EDIT: This was ultimately raised as Complex message start syntax is too strict #610.

@gibson042
Copy link
Collaborator

P.S. Option 1 requires quoting even for selector-free complex message patterns, right? And if so, would quoting be permitted in a simple message or would the above diff require touching both declarations and pattern quoting?

logOutMessage = ```
-{%input username}
-
-{{Log out {$username}?}}
+Log out?
```;

@mihnita
Copy link
Collaborator

mihnita commented Oct 31, 2023

Here are some statistics from leading spaces in the Android resources:

languages file_count string_count string_with_leading percent
all lang 44585 2205156 4644 ~ 0.21%
English only 2369 60384 55 ~ 0.09%
Amharic only 407 25352 87 ~ 0.34%

This was done only on the strings.xml files, and only on <string> tags, not <string-array> or <plurals>

It is interesting that Amharic has about 3.77 times more strings with leading spaces than English.

I don't know if the reason is this or not:
https://www.w3.org/TR/elreq/#first_paragraph

But the validation used for our localization checks and reports inconsistencies between leading / trailing spaces between source (English) and target languages.

The Amharic translators added a leading space.
They ignored the warning from the checker.
And somehow this only happens for Amharic translators, not for other languages.

It is a real phenomenon, and does not look like an accident.

So we might accidentally disadvantage some languages because we think we know better.
But translators know better, our job is to make their life easier and not get in the way.

As Addison said yesterday, it is a big world out there.
We don't know everything about all languages.

@mihnita
Copy link
Collaborator

mihnita commented Oct 31, 2023

We should also be aware of the "chicken and egg" problem.

Some years ago the UTC was arguing that Romanian uses both S and T with cedilla below, and S & T with comma below. Pointing to websites and many new books.

But at the time the most common OS was Windows 95, which was not Unicode. The S & T with cedilla below were present in the Eastern-European code page (cp-1250), S & T with comma below were Unicode only (didn't exist in a Windows supported codepage).
So (of course) all websites, and new books published with DTP software (that one also with a Latin 1 bias) got this wrong.

Looking at limited, existing corpus tells us nothing about what is needed, only tells us about current status.


To this day doing HTML in Chinese / Japanese, Khmer, Thai (and other languages that don't use spaces) is harder than it is in other languages.

In English (and most other languages) you can wrap the text, and HTML rendering will replace it with a space.

But if you do that in languages above you end up with spaces in the middle of the text, where they should not be.
So they are basically not allowed to wrap the text in HTML (or get bad results).

Another example of technology disadvantaging certain languages.

Do we want that for MF2, by design?

The best policy for localization is "don't mess with my text!"
No fancy rules, no escaping, trimming, etc.

@echeran
Copy link
Collaborator

echeran commented Oct 31, 2023

I think @sffc is thinking of the 2a syntax variant that has an opening sigil instead of enclosing sigils. If we changed from enclosing {{...}}, that would make messages cleaner and remove the need for closing }} for the message.

I would be okay with this solution. It seems like a compromise where we always delimit the patterns, and we also avoid having more levels of nested delimiters than people find manageable or tasteful. The key point we're voting on in #505 is about whether we always delimit/quote patterns or not, and I would be okay with this compromise if it would require it.

@mihnita

It is a real phenomenon, and does not look like an accident.

So we might accidentally disadvantage some languages because we think we know better. But translators know better, our job is to make their life easier and not get in the way.

As Addison said yesterday, it is a big world out there. We don't know everything about all languages.

@gibson042

I have separate concerns about preserving leading/trailing whitespace at all (which is hostile to using a line-feed-terminated plain text file as the container format), but AFAICT that's out of scope for #505.

If it is a choice between disadvantaging a language/group of people vs. disadvantaging a file format, I will pick being fair to all people & languages equally if it inconveniences a file format a little bit. As an i18n API, I think it behooves us to prioritize doing things fairly and consistently for all languages & people, where possible. I'm not sure the context of the Amharic usage patterns, but regardless, I don't want to make assumptions that could affect how people use their language, for the cases that we happen to know now, and the ones that we're not yet unaware of.

This is very much part of the reasoning behind Option 1 in #505, so it seems in scope.

@aphillips
Copy link
Member Author

@gibson042 asked:

P.S. Option 1 requires quoting even for selector-free complex message patterns, right?

Option 1 requires that the pattern(s) be quoted in all complex messages, including those with only declarations.

And if so, would quoting be permitted in a simple message or would the above diff require touching both declarations and pattern quoting?

Patterns are currently allowed by the ABNF to consist solely of a quoted pattern (in code mode).

{{{{Log out?}}}}

Adopting a code-mode sigil makes this more attractive:

#{{Log out?}}

Simple patterns are not trimmed. In Option 1, quoted patterns are not trimmed internally either (but the quotes have to be added). Options 3 and 4 both also require pattern quoting (of some sort) to avoid changing the trim state of the pattern. Option 4 might require whitespace quoting in simple patterns (as a somewhat obvious extension)

@gibson042
Copy link
Collaborator

Patterns are currently allowed by the ABNF to consist solely of a quoted pattern (in code mode).

{{{{Log out?}}}}

So messages {{{{Log out?}}}} and {{Log out?}} would be equivalent? Seems fishy... 🤔

If it is a choice between disadvantaging a language/group of people vs. disadvantaging a file format, I will pick being fair to all people & languages equally if it inconveniences a file format a little bit. As an i18n API, I think it behooves us to prioritize doing things fairly and consistently for all languages & people, where possible. I'm not sure the context of the Amharic usage patterns, but regardless, I don't want to make assumptions that could affect how people use their language, for the cases that we happen to know now, and the ones that we're not yet unaware of.

This is very much part of the reasoning behind Option 1 in #505, so it seems in scope.

Continuing discussion on the basis of that final assertion...
The inconvenience is more than "a little bit"... almost every text editor, including those that are explicitly i18n friendly, defaults to ending files with a line feed or other terminator, and tools such as git actively complain about the absence of such. If those characters are treated as meaningful content, then users are going to experience problems and developers are going to experience frustration fixing them (whether or not translators would also be affected is unclear to me). Properly addressing that would require quoting all patterns (including in simple messages), which looks a lot like just starting in code mode (as opposed to addressing it improperly by e.g. stripping final line terminators—are those \n, \r, and/or \r\n?).

Option 1 requires that the pattern(s) be quoted in all complex messages, including those with only declarations.

In addition to the above issue, that creates iteration hazards: adding a declaration to an unquoted simple message must be preceded by quoting it, but adding a declaration to a quoted simple message must not make such content changes (and therefore the agent making the change must distinguish between quoted vs. unquoted simple messages).

Option 1 requiring quotes around patterns in complex messages but not in simple messages changes my ranking of it from a mere dispreference to an objection. From my perspective, unquoted simple messages are only coherent when interpreted as unquoted patterns subject to trimming, because of the way that declarations manifest as in-band insertions.

"Simple message" vs. "complex message" is workable (if unnecessary) taxonomy complexity, but "unquoted simple message" vs. "quoted simple message" vs. "complex message" is a bridge too far.

@aphillips
Copy link
Member Author

aphillips commented Oct 31, 2023

@gibson042

So messages {{{{Log out?}}}} and {{Log out?}} would be equivalent? Seems fishy... 🤔

No. Messages like {{{{Log out?}}}} and Log out? are equivalent.

(and therefore the agent making the change must distinguish between quoted vs. unquoted simple messages).

Simple messages are unquoted patterns by definition. That is {{Log out?}} is a syntax error. (Don't forget that the ABNF exists...)


(chair hat on)

changes my ranking of it from a mere dispreference to an objection

If you wish to change your vote, please do so by editing your existing vote comment. It is permitted to omit an item from your ranking.

@mihnita
Copy link
Collaborator

mihnita commented Oct 31, 2023

I am not sure that {{{{Log out?}}}} is legal today, in code or not.
And even if it is, I think this is something we can cleanup.

Bringing it up here can only muddles the waters and change peoples' votes.
I am quite sure all proposed options can be abused to make them look horrible.

@eemeli
Copy link
Collaborator

eemeli commented Oct 31, 2023

I'd say that noting that the patterns {$foo} and {$bar} and {{{{{$foo} and {$bar}}}}} may be equivalent with an option 1 syntax is appropriate, given that the explainer doc specifically calls out for option 3 that it "Has two ways to represent a pattern."

@mihnita
Copy link
Collaborator

mihnita commented Oct 31, 2023

Again, I don't think {{{{{$foo} and {$bar}}}}} is legal.
And if it is, it is not the required form, it is the abused form.

So trying to put that against option 3 (which is "oh so clean") is just trying to misrepresent things to scare people into voting against 1.
Same as forcing {{...}} on the option instead of {...}, to make it "uglier"

@eemeli
Copy link
Collaborator

eemeli commented Oct 31, 2023

Are you proposing that if option 1 is selected, we should add a new data model error to the spec? Something along the lines of:

A complex message MUST include at least one declaration or a matcher.
Otherwise, it is not considered valid.

Or were you thinking of making it a syntax error?

@sffc
Copy link
Member

sffc commented Oct 31, 2023

My thinking was that we keep the current Option 0 code mode syntax for everything, except if the first non-whitespace character is not { or #, then we interpret the message as a single text mode message. So, like, you can write Hello {$world}! (elided braces) or {Hello {$world}!} (explicit braces) and those are both fine as an entire message.

In other words, we introduce a very narrow condition when the {} can be elided on simple messages, and that narrow condition covers the large majority of messages. For messages with declarations or matchers, just use the Option 0 syntax that we've already agreed upon, and make sure that the first keyword starts with a #.

@aphillips
Copy link
Member Author

@sffc That was more or less my thinking, although I would probably require the code-mode sigil to get to a quoted pattern. If we allow quoted-pattern to start, that's a lot of parsing states (for the machine and for users). The doubled-brackets have grown on me as pattern quotes since starting to fool with them.

I think changing the code mode sigil from paired bracket sets to something like # improves option 1 to me in a true complex message. I really like the cleanliness of the (unwrapped) statements being followed by clearly quoted patterns.

@gibson042 has a point that some folks will force the complex wrapper on simple messages and for that option 1 is less pretty, even with reduced syntactical sugaring:

#{Hello {$world}!}   // option 1 can be written this way or like...
Hello {$world}!      // option 3 and 4

Allowing quoted-pattern into simple messages makes that:

#{Hello {$world}!}   // option 1 can be written this way
{{Hello {$world}!}}  // option 3 can be written this way
Hello {$world}!      // All three can be written this way (4 can only be written this way)

Option 4 has some appeal in that there is only one recipe for pattern and that the whitespace we're all discussing requires some thought (intentionality). PEWS (and only PEWS) is never an accident in 4.

@duerst
Copy link

duerst commented Nov 1, 2023

@mihnita

It is a real phenomenon, and does not look like an accident.
So we might accidentally disadvantage some languages because we think we know better. But translators know better, our job is to make their life easier and not get in the way.
As Addison said yesterday, it is a big world out there. We don't know everything about all languages.

@gibson042

I have separate concerns about preserving leading/trailing whitespace at all (which is hostile to using a line-feed-terminated plain text file as the container format), but AFAICT that's out of scope for #505.

If it is a choice between disadvantaging a language/group of people vs. disadvantaging a file format, I will pick being fair to all people & languages equally if it inconveniences a file format a little bit. As an i18n API, I think it behooves us to prioritize doing things fairly and consistently for all languages & people, where possible. I'm not sure the context of the Amharic usage patterns, but regardless, I don't want to make assumptions that could affect how people use their language, for the cases that we happen to know now, and the ones that we're not yet unaware of.

I very much agree with the general feeling!

But the case of Amharic, at least with the numbers above (0.34%, vs. 0.21% average and 0.09% for English) doesn't really justify anything. If Amharic translators have to do 0.13% more work than for spaces than the average, that's completely negligible given that in many languages, text length or word count expand by 10% or more on translation. And that's counting adding the spaces and surrounding the syntax at the same weight as the translation itself, which is surely an overestimation.

@gibson042 gibson042 changed the title DIscussion Thread for Delimiting Pattern Boundaries Discussion Thread for Delimiting Pattern Boundaries Nov 1, 2023
@gibson042
Copy link
Collaborator

If you wish to change your vote, please do so by editing your existing vote comment. It is permitted to omit an item from your ranking.

Done, but the new ranking is actually contingent on support for untrimmed unquoted simple messages. If Option 1 truly were "always quote patterns", including in simple messages, then I would consider it acceptable.

@aphillips
Copy link
Member Author

@gibson042

Done, but the new ranking is actually contingent on support for untrimmed unquoted simple messages. If Option 1 truly were "always quote patterns", including in simple messages, then I would consider it acceptable.

(chair hat off)

I'm curious about your reasoning. "Always quote patterns" would be our old syntax, which always starts in code mode. Can you explain why you feel it's better to always quote the pattern?

Unquoted simple patterns seem really straightforward to use and meet the majority of message cases.

@gibson042
Copy link
Collaborator

gibson042 commented Nov 1, 2023

Full reasoning is explained in #507 (comment) , but the gist of it is that I don't feel it's better to always quote the pattern—rather that the combination of untrimmed unquoted simple messages along with required quoting in complex messages is incoherent and hazardous to message iteration (in part because a message must first be classified as simple vs. complex by examining its contents).

My top preference is supporting and trimming unquoted patterns in both simple and complex messages, and I can live with a model in which patterns must always be quoted in both (preserving all contents inside the quotes), but breaking that internal consistency creates too many problems for me to feel comfortable with.

@sffc
Copy link
Member

sffc commented Nov 1, 2023

@gibson042 Is your concern satisfied by my proposal where simple messages permit a single quoted pattern? Then there is no syntactic difference between simple and complex messages so long as the host has a way to normalize simple messages to be quoted.

@mihnita
Copy link
Collaborator

mihnita commented Nov 1, 2023

But the case of Amharic, at least with the numbers above (0.34%, vs. 0.21% average and 0.09% for English) doesn't really justify anything.

I agree that when we look at one language / script the numbers might be small.

But over the last few weeks we've heard arguments trying to minimize the problem with spaces:

  • for languages that don't use spaces (Chinese, Japanese, Thai, Khmer, etc)
  • Amharic
  • applications using OS widgets (non-HTML applications), for all major OSes
  • command line applications
  • warnings from people who worked with localization (seeing translator complaints, triaging bugs, etc)

Requirements to escape certain characters, automated trimming, changing case, in any kind of automatic operations on text provided by translators, have been source of many bugs.
The best policy for localizable string is "use what the human translator gives you" (I don't know what the rules for AI content should be :-)
You can assist that human element with all kind of validations, but taking what they give you and outguess them is usually asking for problems.
And when you translate into a big number of languages (think 100), the any problem in the source is amplified.

@gibson042
Copy link
Collaborator

gibson042 commented Nov 1, 2023

@gibson042 Is your concern satisfied by my proposal where simple messages permit a single quoted pattern? Then there is no syntactic difference between simple and complex messages so long as the host has a way to normalize simple messages to be quoted.

If complex messages require quoting patterns but simple messages do not, then there is a syntactic difference and the problem still exists (and is even exacerbated if quoting a known simple message requires more than just wrapping it in relevant syntax).

@sffc
Copy link
Member

sffc commented Nov 1, 2023

The normalization algorithm in my proposal would be just

function normalizeMessage(message) {
    let message = trim(message);
    if (message[0] !== "{" && message[0] !== "#") {
        message = "{" + message + "}";
    }
    return message;
}

That seems like a fairly easy function to run in order to get messages into the same syntax. I don't see the practical issue with that.

@aphillips
Copy link
Member Author

(chair 🎩 off)

@gibson042 Thanks for this. So, to ensure I understand, what you're saying is: you'd prefer that simple patterns were trimmed (except where quoted) because complex ones are.

I think that's a separate proposal from the one we're currently balloting? That is, 1/3/4 are concerned with finding the pattern boundaries in complex messages. Depending on which we choose, we could go on to make simple patterns consistent with the quoting behavior of the chosen alternative.

For me, I think that I might want simple messages to trim if we chose 3 or 4 (because that would be consistent with pattern handling in each), but I'm not as sure about 1. In option 1, the pattern is quoted in complex messages and not trimmed inside the quotes. Simple messages in option 1 are quoted by their container, e.g.:

var myMessage = "   I am quoted by my {$container}   ";

@sffc

If your first line is let message = trim(message); you just remove the pattern exterior whitespace on myMessage in the example above. I think probably the { or # sigils have to start out as the first character in the message, otherwise the message is in text mode.

If we adopted what I think @gibson042 is suggesting, you still would not trim until inside the if statement, e.g.:

function normalizeMessage(message) {
    if (message[0] !== "{" && message[0] !== "#") {
        message = "{" + trim(message) + "}";
    }
    return message;
}

Also, you have to check if the message starts with a variable (and we allow optional whitespace there, as @eemeli has pointed out), which is why the double-sigil pattern quotes ({{...}}) have a certain appeal. That is, your function fails on this message: {$var} is saying hello or this one: { $var :foo } is saying hello

@sffc
Copy link
Member

sffc commented Nov 1, 2023

Also, you have to check if the message starts with a variable (and we allow optional whitespace there, as @eemeli has pointed out)

Oh that's a good point. I guess the function needs to be slightly more complex in order to check for that case.

But, my overall point remains that one could construct a fairly simple function in order to transform a simple message to a complex message, which I hope addresses Gibson's concern. If requiring a normalization function doesn't address that concern, then I can see a valid argument that the quotation rules should be the same between simple and complex messages.

@gibson042
Copy link
Collaborator

@gibson042 Thanks for this. So, to ensure I understand, what you're saying is: you'd prefer that simple patterns were trimmed (except where quoted) because complex ones are.

I think that's a separate proposal from the one we're currently balloting? That is, 1/3/4 are concerned with finding the pattern boundaries in complex messages. Depending on which we choose, we could go on to make simple patterns consistent with the quoting behavior of the chosen alternative.

Yes, that seems about right—the proposal currently subject to balloting relates to other design aspects in ways that deeply affect my assessment. So I have voted narrowly in #505 according to the instructions, but provided more detail here.

Simple messages in option 1 are quoted by their container, e.g.:

var myMessage = "   I am quoted by my {$container}   ";

Yes, that is the combination of behaviors that I find objectionable because of how it complicates transitions between simple and complex messages.

But, my overall point remains that one could construct a fairly simple function in order to transform a simple message to a complex message, which I hope addresses Gibson's concern. If requiring a normalization function doesn't address that concern, then I can see a valid argument that the quotation rules should be the same between simple and complex messages.

My overall point is that such a function is not sufficiently simple, as demonstrated in this very conversation.

@sffc
Copy link
Member

sffc commented Nov 2, 2023

Not really sure why the simplicity of a normaliser really matters

At some point, the code needs to figure out whether the message is simple or complex. The problem is that we start in a weird parse state where we don't know whether the message is simple or complex, so we need to read the message until we make that determination, then rewind to the beginning and start over in the correct mode.

If you use parseMessage followed by stringifyMessage, you're just hiding the problem away in those functions. The problem still exists.

@gibson042
Copy link
Collaborator

To be clear, I brought up not normalization but coherence and iteration hazards [in the textual representation]. But if it is assumed that message manipulation will be performed on data structures rather than text, then there should be only one source format anyway—optimized for machines rather than people when those are in conflict, and with nothing like a "simple" variant.

@aphillips
Copy link
Member Author

@sffc said:

The problem is that we start in a weird parse state where we don't know whether the message is simple or complex, so we need to read the message until we make that determination, then rewind to the beginning and start over in the correct mode.

Huh? If the first two characters are the code-mode shift, we progress from there (as a complex message), otherwise we're in a simple message. You don't have to read the whole message to figure that out.

@mihnita
Copy link
Collaborator

mihnita commented Nov 2, 2023

One way to think about consistency:
We don't have to think that in when * {{ foo }} the {{ foo }} part is the pattern

Same as
simple message:

print("Hello")
print(userName)
print("!")

and

if foo {
   print("Hello")
   print(userName)
   print("!")
}

In the second case the { and } are not part of the "pattern" (block).
We are used to that as programmers.

Our bnf can be variant = when 1*(s key) [s] '{{' pattern '}}'
There are no quoted / unquoted patterns.

@gibson042
Copy link
Collaborator

@mihnita My objection is based on the practical consequences of quoting syntax being absent in a "simple message" but mandatory in a "complex message" while any given message may fall into either category, and is not affected by whether or not that syntax appears inside or outside of the productions for a pattern nonterminal in the formal grammar.

@mihnita
Copy link
Collaborator

mihnita commented Nov 2, 2023

Some summary I came up with a day or two ago, trying to understand inconsistencies, and how to choose.
We have several combinations:

  • message type : simple / complex
  • wrap the pattern or not
  • trim the leading / trailing spaces or not

Complete set:

  • simple [ trim, wrap ] complex [ trim, wrap ]
  • simple [ trim, wrap ] complex [ trim, no wrap ]
  • simple [ trim, wrap ] complex [ no trim, wrap ]
  • simple [ trim, wrap ] complex [ no trim, no wrap ]
  • simple [ trim, no wrap ] complex [ trim, wrap ]
  • simple [ trim, no wrap ] complex [ trim, no wrap ]
  • simple [ trim, no wrap ] complex [ no trim, wrap ]
  • simple [ trim, no wrap ] complex [ no trim, no wrap ]
  • simple [ no trim, wrap ] complex [ trim, wrap ]
  • simple [ no trim, wrap ] complex [ trim, no wrap ]
  • simple [ no trim, wrap ] complex [ no trim, wrap ]
  • simple [ no trim, wrap ] complex [ no trim, no wrap ]
  • simple [ no trim, no wrap ] complex [ trim, wrap ]
  • simple [ no trim, no wrap ] complex [ trim, no wrap ]
  • simple [ no trim, no wrap ] complex [ no trim, wrap ]
  • simple [ no trim, no wrap ] complex [ no trim, no wrap ]

I drop all combinations that contain [ trim, wrap ], because nobody wants that.
The whole argument for wrapping is to make clear where the string really starts / ends, without the need to trim.
So:

  • simple [ trim, no wrap ] complex [ trim, no wrap ]
  • simple [ trim, no wrap ] complex [ no trim, wrap ]
  • simple [ trim, no wrap ] complex [ no trim, no wrap ]
  • simple [ no trim, wrap ] complex [ trim, no wrap ]
  • simple [ no trim, wrap ] complex [ no trim, wrap ]
  • simple [ no trim, wrap ] complex [ no trim, no wrap ]
  • simple [ no trim, no wrap ] complex [ trim, no wrap ]
  • simple [ no trim, no wrap ] complex [ no trim, wrap ]
  • simple [ no trim, no wrap ] complex [ no trim, no wrap ]

Then we remove all cases of simple with wrap, because this is what we try to improve now, delimiting messages like "Hello world"

  • simple [ trim, no wrap ] complex [ trim, no wrap ]
  • simple [ trim, no wrap ] complex [ no trim, wrap ]
  • simple [ trim, no wrap ] complex [ no trim, no wrap ]
  • simple [ no trim, no wrap ] complex [ trim, no wrap ]
  • simple [ no trim, no wrap ] complex [ no trim, wrap ]
  • simple [ no trim, no wrap ] complex [ no trim, no wrap ]

We also take out simple trim no-wrap: Hello world , because trimming makes the strings really hard to translate.
If I get "foo" to translate and need to make it " fouuz" I don't know how to protect those spaces.
I don't know if the string goes through an API at all, might be use as is. Or go through printf-like API, or MF1, or something else. That is also true for a tool. There is just no way to tell.
So simple messages must not be trimmed!

And we are left with these

  • simple [ no trim, no wrap ] complex [ trim, no wrap ] => Option 3
  • simple [ no trim, no wrap ] complex [ no trim, wrap ] => Option 1
  • simple [ no trim, no wrap ] complex [ no trim, no wrap ] => I don't think anyone is asking for this

So from what we see above, both options 1 and 3 have inconsistencies between simple and complex.
Option 3 does not trim simple messages, but trims complex ones.
Option 1 does not wrap simple messages, but wraps complex ones.

No matter what, we will have one inconsistency.

I would argue that the second kind of inconsistency (option 3) is worse.
The inconsistency in option 3 is an inconsistency in behavior: sometimes the string itself changes, sometimes not.
The inconsistency in option 1 is only a visual one. Simple / complex look a bit different, but behave the same: not trimmed.

@sffc
Copy link
Member

sffc commented Nov 2, 2023

Huh? If the first two characters are the code-mode shift, we progress from there (as a complex message), otherwise we're in a simple message. You don't have to read the whole message to figure that out.

This is only true if either:

  • Code mode requires some specific special characters that are distinguishable from a simple message (2a used {{}} which I'd like to avoid; I proposed # earlier in this thread but it doesn't appear to work in all cases), OR
  • Placeholders use different quotes than patterns, OR
  • Single unquoted patterns are not allowed in the syntax

@mihnita
Copy link
Collaborator

mihnita commented Nov 2, 2023

@gibson042

I think I understand your argument for quoting simple messages.
And I'm on board. It is in fact what we had before this whole thing was reopened.

But this can of worms is already opened.
I didn't ask for it, but I accepted that the option is off the table now, and I am trying to find a compromise.

I think that most developers will not work that much with mf2 syntax to have a "mental model" and so on.
As developers we work with all kind of "domain specific languages" (DSL), and some seem to have quirks, for good / bad reasons, and we don't spend that much time on it. We use them as designed and go ahead.

For me, this is a DSL designed for i18n / l10n.
As such, should cater to translators a bit more than to developers.
Would be nice to have both, if possible, but correctness is more important.
If one developer grumbles because they need to type some extra curly brackets, but it means that 70 translators don't need to think how to deal with magically disappearing spaces, then I am for that.

@mihnita
Copy link
Collaborator

mihnita commented Nov 2, 2023

2a used {{}} which I'd like to avoid;

In fact is uses {}, but a week or two ago it was converted to {{}} to make it uglier, to make sure people are willing to work on it and not just take it "as is"
I'm not kidding.
It was something like this:

{#
match ...
when one {Deleted {count} file}
when *   {Deleted {count} files}
#}

Placeholders use different quotes than patterns

Although visually ugly, the reason to use {} is some consistency in escaping.
Whatever we use to close the pattern must be escaped in the pattern if we want it rendered.

For example: when one ~Deleted {count} file~
Now I have to escape { and ~: when one ~The \{ and \~ must be escaped~
Remembering that you must escape { and } seem a bit easier to remember.

But I would not be against using ~ or ^ (or some other rarely used character).
Eemeli even proposed " as delimiter, and that sounds natural, and wold look familiar.
The trouble is conflict with storage formats.
You put that in a string in code (gettext like), or in JSON, and now if you want to show it you must double escape.

Example (MF2):

match {$measurement_system} when imperial "Use \" for height" when * "Use cm for height"

Now put that in a format that uses \ for escaping and " for strings (JSON, or code):

"msg" : "match {$measurement_system} when imperial \"Use \\\" for height\" when * \"Use cm for height\""

So it is best if the characters to use as delimiters are rarely used in human messages, and rare in existing "storage formats"

But I can live with something like ~ or ^

@gibson042
Copy link
Collaborator

If I get "foo" to translate and need to make it " fouuz" I don't know how to protect those spaces.
I don't know if the string goes through an API at all, might be use as is. Or go through printf-like API, or MF1, or something else. That is also true for a tool. There is just no way to tell.
So simple messages must not be trimmed!

@mihnita You keep using examples like this, but I think you're relying upon some unstated assumptions. In what context does the hypothetical you «get "foo" to translate and need to make it " fouuz"»? If you're using a tool that outputs MF2 messages, then the tool must understand MF2 and produce appropriate output based upon its understanding of whether or not spaces in your translations should be preserved (presumably defaulting to preservation). If you're editing MF2 messages directly, then that responsibility falls on you. And if either you or a tool are trying to output content that could be interpreted as MF2 or as some other format, then I just don't see how that can possibly work because syntax such as {|literal|} is specific to MF2.

As developers we work with all kind of "domain specific languages" (DSL), and some seem to have quirks, for good / bad reasons, and we don't spend that much time on it. We use them as designed and go ahead.

And as designers, it's our responsibility to minimize such quirks.

For me, this is a DSL designed for i18n / l10n. As such, should cater to translators a bit more than to developers. Would be nice to have both, if possible, but correctness is more important. If one developer grumbles because they need to type some extra curly brackets, but it means that 70 translators don't need to think how to deal with magically disappearing spaces, then I am for that.

So you are expecting translators to work with MF2 messages directly? That seems like a population that would be even less comfortable with having to understand that quoting patterns is mandatory in one context and forbidden in another, let alone with understanding the nuance of how to classify any particular message into the appropriate category. If that is indeed the case, then I would push for requiring all patterns to be quoted and dropping the "simple message" concept. But if instead we take a position like "message contents are for developers and pattern contents are for translators", then there are some coherent designs which can include a "simple message" concept (specifically, Option 1 with universally mandatory pattern quoting and Option 3 or 4 with universal whitespace trimming).

I suppose what I'm really objecting to is a syntactic distinction between "simple message" and "complex message", as opposed to a semantic distinction (in which e.g. a declaration may be prepended to any message, the result being a complex message even if the original message is simple, but its output being unaffected in either case). It's just that taking support for unquoted simple messages as a given drives #505 to options 3 and 4.

@sffc
Copy link
Member

sffc commented Nov 2, 2023

How about #{} for pattern quotes? This might get to the outcome I'd like to see:

Simple messages:

  • Hello {$world}!
  • {$hello} world! (unambiguous: leading { is a simple message)
  • \#1 dad (leading # requires escaping)

Quoted simple messages:

  • #{Hello {$world}!}
  • #{{$hello} world!}
  • #{#1 dad} (no longer a need to escape the #)
  • #{\#1 dad} (escaping permitted, just not required)

Complex messages:

#match ...
#when one #{Deleted {count} file}
#when *   #{Deleted {count} files}

@aphillips
Copy link
Member Author

@sffc mentioned:

I proposed # earlier in this thread but it doesn't appear to work in all cases),

Why not? It would require that # be escaped if it appeared as the first character of an unquoted simple pattern. But that's hardly a killer.

I originally proposed # (rather than some other bit of punctuation) in various syntax alternatives because #input and #local remind me of #define. None of the syntax conflicts with using # as a starter.

How about #{} for pattern quotes?

I'm really much happier with {{...}} for pattern quotes. I'm unhappy with double {{ for the code mode introducer, but that's not pattern quotes. I'd be unhappy with #{ for patterns because I don't see what value it brings (it's just quirky??)

If we want to make pattern quotes be single {...} then that does mean we're not LL1 or LL2 any more.

@mihnita mentioned:

simple [ no trim, no wrap ] complex [ no trim, no wrap ] => I don't think anyone is asking for this

It was one of the five original options (pre-balloting)--it is the "missing" #2, in fact. So people explicitly and unanimously asked that we not consider this one.


As @mihnita noted, we made the current syntax slightly uglier so that we could get to the problem of pattern whitespace/quoting. Once the balloting is done, we can clean the resulting syntax. That will require specific proposals, albeit ones we've already been entertaining in other threads.

Personally, I'm tempted to go change my vote, because there is a certain parsimony to:

Hello world!

Hello, {$user}!

#input {$numMessages :number integer}
{{Hello {$user}, you have {$numMessages} message(s)}}    // yes, an i18n bug

#match {$numMessages :plural}
when 0   {{Hello {$user}, you have no messages}}
when one {{Hello {$user}, you have {$numMessages} message}}
when *   {{Hello {$user}, you have {$numMessages} messages}}

I'm a bit mystified about the desire to quote simple messages. I get that machines will want to homogenize messages and sometimes will be brutal in doing so. But, to @eemeli's point, if you're using a data model internally, it's pretty simple to tell if you need to quote the whole pattern when serializing. If there are no declarations or selectors, just emit the pattern!

I also think it is totally valid to code-mode-and-quote the pattern with no declarations or selectors. But I wouldn't be excited about an implementation that did that by default. It would be sort of like always using MF1 like this:

{0,select *{Hello world!}}

@gibson042 noted:

I suppose what I'm really objecting to is a syntactic distinction between "simple message" and "complex message", as opposed to a semantic distinction (in which e.g. a declaration may be prepended to any message, the result being a complex message even if the original message is simple, but its output being unaffected in either case).

Unquoted patterns lend themselves to this better than option 1 does:

Hello, {$user}!

#input {$user :person formality=informal}Hello, {$user}!
// or
#input {$user :person formality=informal}
Hello, {$user}!

Admittedly the declaration "affects the output"--in a positive way. A translator might very much want to add a declaration to get specific options needed for their language with the person name formatter!

@gibson042
Copy link
Collaborator

gibson042 commented Nov 2, 2023

I suppose what I'm really objecting to is a syntactic distinction between "simple message" and "complex message", as opposed to a semantic distinction (in which e.g. a declaration may be prepended to any message, the result being a complex message even if the original message is simple, but its output being unaffected in either case).

Unquoted patterns lend themselves to this better than option 1 does:

Hello, {$user}!

#input {$user :person formality=informal}Hello, {$user}!
// or
#input {$user :person formality=informal}
Hello, {$user}!

The last of these gets at what I'm talking about—it only satisfies the constraint if behavior includes trimming whitespace (otherwise "Hello, …!" gets transformed into "\nHello, …!", with an extra leading line feed).

For comparison, here is the equivalent in Option 1 with universally mandatory pattern quoting:

{{Hello, {$user}!}}

#input {$user :person formality=informal}{{Hello, {$user}!}}
// or
#input {$user :person formality=informal}
{{Hello, {$user}!}}

And note that the constraint is bidirectional, so Option 3 only works if pattern quoting is optional in complex messages and simple messages (i.e., that both of the above code blocks represent a valid simple message and two variations of equivalent complex analogs in Option 3, and therefore that every parser can differentiate unquoted simple message vs. quoted simple message vs. complex message).

@mihnita
Copy link
Collaborator

mihnita commented Nov 2, 2023

If we want to make pattern quotes be single {...} then that does mean we're not LL1 or LL2 any more.

I think we are, because parsing moves to another state, and that state is clearly defined.
The parser does no look at input only, it is state + input => new state (maybe) + some other actions
Typical state machine.
When you see { and you are in pattern mode already is a placeholder. When you are in code mode, after a when state, you see { it is a pattern. No ambiguity.

@mihnita
Copy link
Collaborator

mihnita commented Nov 2, 2023

But, to @eemeli's point, if you're using a data model internally, it's pretty simple to tell if you need to quote the whole pattern when serializing

I agree with that.
The always wrap benefit (in complex messages) is not for parsing / serialization, it is the human factor.

@mihnita
Copy link
Collaborator

mihnita commented Nov 2, 2023

@gibson042

You keep using examples like this, but I think you're relying upon some unstated assumptions. In what context does the hypothetical you «get "foo" to translate and need to make it " fouuz"»?

They are not assumptions, and have been stated before.

Chinese uses spaces in front of names ("honorific spaces"). So "John Show knows nothing" needs to become " John Show ..."
Amharic seems to use leading spaces to "fake" an indent (I shared numbers and links)
If the source is a language that does not use spaces (Chinese, Japanese, Thai, Khmer, others) and you translate into one that does, then sometimes messages will require a leading or trailing space to separate from the next message (sentence).
Because people writing code in these languages too, not everything starts in English.

If you're using a tool that outputs MF2 messages, then the tool must understand MF2 and produce appropriate output based upon its understanding of whether or not spaces in your translations should be preserved.

But often (and especially for simple messages) tools have no way to know.
If I give you a .properties file with

msg = Hello world

no human nor tool can tell if that string is used as is, goes through some kind of printf API, or the jdk MessagesFormat, the ICU one (mf1) or MF2

And the way things are adopted, it is in fact common to see various kinds of string formats in the same file (going through different APIs)
Unless it's a small project nobody will modify all the code they have to migrate to a new API for no good reason.

@gibson042
Copy link
Collaborator

gibson042 commented Nov 3, 2023

@gibson042

You keep using examples like this, but I think you're relying upon some unstated assumptions. In what context does the hypothetical you «get "foo" to translate and need to make it " fouuz"»?

They are not assumptions, and have been stated before.

Chinese uses spaces in front of names ("honorific spaces"). So "John Show knows nothing" needs to become " John Show ..." Amharic seems to use leading spaces to "fake" an indent (I shared numbers and links) If the source is a language that does not use spaces (Chinese, Japanese, Thai, Khmer, others) and you translate into one that does, then sometimes messages will require a leading or trailing space to separate from the next message (sentence). Because people writing code in these languages too, not everything starts in English.

The assumptions I'm referring to are not regarding the language in use, but rather the nature of the actor that gets "foo" and the nature of the system into which they put " fouuz".

If you're using a tool that outputs MF2 messages, then the tool must understand MF2 and produce appropriate output based upon its understanding of whether or not spaces in your translations should be preserved.

But often (and especially for simple messages) tools have no way to know. If I give you a .properties file with

msg = Hello world

no human nor tool can tell if that string is used as is, goes through some kind of printf API, or the jdk MessagesFormat, the ICU one (mf1) or MF2

Do all of those systems apply the same treatment to relevant whitespace? Do they all use the same syntax to ensure that whitespace is treated as significant? The answer is almost certainly "no", and is certainly "no" for the related questions about increasingly sophisticated input. And it's not possible to impose consistency upon an already inconsistent heterogeneous environment. I just don't understand what kind of MF2 design you expect to accommodate meaningful edits of input that may or may not be sent to it, or how you expect anyone to meaningfully edit input that might be sent to MF2 directly, to MF2 after transformation, or to something entirely separate from MF2.

EDIT: @mihnita perhaps you can define the subset of e.g. .properties file syntax that you expect to be accepted directly by MF2 with particular semantics and also by some other system(s)? Presumably it must exclude values that start with or include syntax that is special in MF2, but I'm also wondering in particular about whitespace. AFAIK, foo=bar and foo = bar are treated identically by consumers of .properties files, although for contrast the value in foo = \ bar includes a leading space (i.e., consumers interpret but do not propagate backslash escape sequences). But note also that nothing necessitates direct input anyway... it's perfectly reasonable to set up a system in which .properties file data is sent to MF2 with intermediate processing, such that e.g. the four-character value from foo = \ bar is sent to MF2 as a message {{ bar}}, allowing people without any MF2 awareness to manually edit .properties file values (provided they don't modify MF2 placeholders)—and in practice such intermediate processing will be necessary for iterative transition to MF2 anyway.

@eemeli
Copy link
Collaborator

eemeli commented Nov 3, 2023

But often (and especially for simple messages) tools have no way to know. If I give you a .properties file with

msg = Hello world

no human nor tool can tell if that string is used as is, goes through some kind of printf API, or the jdk MessagesFormat, the ICU one (mf1) or MF2

This failure of the .properties format to support identifying the format used in its contained messages sounds like something we should account for in the message resource spec rather than the message spec.

@stasm
Copy link
Collaborator

stasm commented Nov 3, 2023

@aphillips

I'm really much happier with {{...}} for pattern quotes. I'm unhappy with double {{ for the code mode introducer, but that's not pattern quotes. [...] If we want to make pattern quotes be single {...} then that does mean we're not LL1 or LL2 any more.

I've come to prefer double curlies for patterns too, even if originally we picked them to make the syntax "uglier." I find it easier to identify patterns when they're wrapped in {{...}}, and they helped address the issue of {...} serving double duty, which I've always struggled with (but accepted it) in our original syntax. Now, things are easier for me to build a mental model around: {...} are for expressions, {{...}} are for text.

I'm a bit mystified about the desire to quote simple messages.

I realize we don't have a good write-up about text-first mode, do we? I filed #512 to track it.

@stasm
Copy link
Collaborator

stasm commented Nov 3, 2023

@gibson042

To be clear, I brought up not normalization but coherence and iteration hazards [in the textual representation].

Thanks for bringing up iteration hazards, I've been also worried about them, and I think we should attempt to build a catalogue of "user journeys" that capture them.

[...] If that is indeed the case, then I would push for requiring all patterns to be quoted and dropping the "simple message" concept. But if instead we take a position like "message contents are for developers and pattern contents are for translators", then there are some coherent designs which can include a "simple message" concept (specifically, Option 1 with universally mandatory pattern quoting and Option 3 or 4 with universal whitespace trimming).

How is requiring all patterns to be quoted and dropping the "simple message" concept different from Option 1 with universally mandatory pattern quoting?

IIUC, your preference for option 3 or 4 is contingent on trimming whitespace universally. If this wasn't the case, i.e. we opted for preserving whitespace in simple patterns and trimming in complex, would that change your preference? I know you were concerned about the need to additionally add/remove pattern delimiters to a simple pattern when changing declarations, but I think it can be argued that in option 1, the iteration hazard is less significant because it needs to be explicitly addressed (at the cost of an extra effort).

@gibson042
Copy link
Collaborator

[...] If that is indeed the case, then I would push for requiring all patterns to be quoted and dropping the "simple message" concept. But if instead we take a position like "message contents are for developers and pattern contents are for translators", then there are some coherent designs which can include a "simple message" concept (specifically, Option 1 with universally mandatory pattern quoting and Option 3 or 4 with universal whitespace trimming).

How is requiring all patterns to be quoted and dropping the "simple message" concept different from Option 1 with universally mandatory pattern quoting?

The former is technically a subset of the latter... "Option 1 with universally mandatory pattern quoting" could involve dropping the simple message concept (such that e.g. {{Log out?}} and {%input $ignored}{{Log out?}} are both valid and equivalent), or could maintain it by requiring the intermediate "code mode" layer (e.g., {{ input {$ignored}{{Log out?}} }} as in the Option 1 example). But I think you are right to ask the question, and upon consideration I believe that introducing mandatory pattern quoting would drive out that portion of the design space.

IIUC, your preference for option 3 or 4 is contingent on trimming whitespace universally. If this wasn't the case, i.e. we opted for preserving whitespace in simple patterns and trimming in complex, would that change your preference? I know you were concerned about the need to additionally add/remove pattern delimiters to a simple pattern when changing declarations, but I think it can be argued that in option 1, the iteration hazard is less significant because it needs to be explicitly addressed (at the cost of an extra effort).

I agree that creating such a large amount of friction can work, but as observed above, it seems worse than just eliminating the concept of simple messages. A syntactic split with in-band differentiation is complicated for both machines and people, and I'm just not seeing how preservation of leading/trailing whitespace by MessageFormat is valuable enough to justify it—especially since invisible content is bad in general, and in my opinion errors from such a decision (e.g., inadvertent inclusion of trailing spaces and/or line feeds) would exceed intentional use by an order of magnitude.

I think instead that always trimming leading/trailing whitespace while maintaining a mechanism to explicitly include it, as in Options 3 and 4 (and apparently also in .properties files), fosters a coherent and comprehensible system in which embedded declarations make sense and are non-disruptive. But Option 1, although still not my preference, can also achieve that if the mandatory quoting is truly universal, which seems to preclude simple messages as presented up to this point (although contrary to some claims does not preclude oblivious editing of e.g. .properties files because there's nothing requiring such strings to always be interpreted as messages rather than raw patterns—and where there is a need for both then the responsibility of differentiation can be a property of the affected system1 rather than burdening everyone who uses MessageFormat).

Footnotes

  1. Note that there a variety of possibilities here... such systems can impose their own prefix for distinguishing messages from patterns, but can also leverage out-of-band metadata signals such as adjacent keys, key name suffixes, and key attributes.

@aphillips
Copy link
Member Author

I agree that creating such a large amount of friction can work, but as observed above, it seems worse than just eliminating the concept of simple messages. A syntactic split with in-band differentiation is complicated for both machines and people, and I'm just not seeing how preservation of leading/trailing whitespace by MessageFormat is valuable enough to justify it—especially since invisible content is bad in general, and in my opinion errors from such a decision (e.g., inadvertent inclusion of trailing spaces and/or line feeds) would exceed intentional use by an order of magnitude.

I think one of the things that can be overlooked here is that MessageFormat is an embedded syntax. That is, the message will always be stored in some other format (properties, po, programming language source code, etc.). That external container will include the means of identifying the external boundaries of the message itself.

A lot of digital ink has been spilled around the .properties format, which is not probably the best example. That format does not provide delimiters and it does trim whitespace. Multiline requires special syntactic goo. People working in properties files are going to have to deal with that--just as they already do. On some level this has nothing to do with MessageFormat, though, since we only care about what is in the message once the storage syntax has been resolved.

There is interplay here, which is why we avoid single- and double-quotes in our syntax and also why we do not define Unicode escapes such as \u20AC. We mostly let the storage syntax do what it is good at.

Most storage syntaxes provide visible delimiters of some sort. In those cases, the external spaces on a simple message are visible. This does not mean that we are required to preserve those spaces. We could choose to trim the message (include simple messages). But we are not required to do so. I think this is a very useful debate.

One of my audience participation slides for Tuesday at the Unicode Technical Workshop actually asks this question:
image

I tend to agree that non-trimming will result in at least some cases with unintentional whitespace inclusion... just as a trimming regime will result in at least some cases with unintentional whitespace removal.

(e.g., {{ input {$ignored}{{Log out?}} }} as in the Option 1 example).

The option 1 example doesn't say that. A bare log-out message in option 1 would look like:

{{ {{Log out?}} }}

If we beautify the syntax as suggested elsewhere, it could be:

#{{Log out?}}

...but most processors probably notice that lack of declarations and selectors and remove the decoration?

Also, note that all three options permit this (whitespace bearing) Option-4-style message:

{||}   Log out?   {||}

@gibson042
Copy link
Collaborator

I think one of the things that can be overlooked here is that MessageFormat is an embedded syntax. That is, the message will always be stored in some other format (properties, po, programming language source code, etc.). That external container will include the means of identifying the external boundaries of the message itself.

I've tried to remain very conscious of that, and am thinking about scenarios where the container is a file (which almost always end with a line feed, in part because tools like git complain otherwise) or a multi-line string in some programming language (which frequently start and end with line feed because that makes for more readable source, as in my original example):

 logOutMessage = ```
-{%input username}
-
-Log out {$username}?
+Log out?
 ```;

(before: '\n{%input username}\n\nLog out {$username}?\n', after: '\nLog out?\n')

A lot of digital ink has been spilled around the .properties format, which is not probably the best example. That format does not provide delimiters and it does trim whitespace. Multiline requires special syntactic goo. People working in properties files are going to have to deal with that--just as they already do. On some level this has nothing to do with MessageFormat, though, since we only care about what is in the message once the storage syntax has been resolved.

Agreed, and I would again emphasize that nothing requires a direct connection from .properties to MessageFormat when an intermediary for upgrading raw patterns to messages would be appropriate.

One of my audience participation slides for Tuesday at the Unicode Technical Workshop actually asks this question

Excellent! But the particular example does seem a bit biased because those spaces are unexpected... might I recommend also including one like mine, where the trimmable whitespace includes line feeds?

(e.g., {{ input {$ignored}{{Log out?}} }} as in the Option 1 example).

The option 1 example doesn't say that. A bare log-out message in option 1 would look like:

{{ {{Log out?}} }}

I was trying to match the Option 1 example syntax for a message that has one declaration and no selector... does {{ input {$ignored}{{Log out?}} }} fail to meet that goal?

@stasm
Copy link
Collaborator

stasm commented Nov 4, 2023

After a lot of hesitation, I voted 1 > 3 > 4. Given the hesitation and the fact that I could relate to the arguments behind both 1 and 3, I decided to vote for the option that is more explicit. In Option 1, we don't even need to ask the question whether variant patterns are autotrimmed or not.

The fact that Option 1 effectively introduces two syntaxes is not ideal, but it's a tradeoff between simplicity and convenience that isn't unheard of in programming languages. For instance, in JavaScript, arrow functions can have a simple expression body with an implied return, or a block body wrapped in curly braces, in which the return statement is explicit. I see our simple messages and complex messages in a similar way.

It's also a way to avoid iteration hazards: adding a declaration requires specifying the pattern boundary explicitly. In Option 3, OTOH, adding a declaration changes the trimming behavior -- unless we also autotrim simple patterns, which I don't think we should.

@aphillips
Copy link
Member Author

2023-11-06 Group consensus is now option 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants