-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion Thread for Delimiting Pattern Boundaries #507
Comments
@sffc wrote: Another strong "pro" of option 1 is that it should allow us to elide the quotations around code mode if we introduce a sigil. For example: Simple text mode message:
Code mode message, identified due to the leading sigil:
More complex code mode message, also identified due to the leading sigil:
With optional quoting (option 3), |
I don't think I understand what you mean here; how is privileging |
@gibson042 (noting that the comment is actually Shane's and I just moved it here) I think @sffc is thinking of the 2a syntax variant that has an opening sigil instead of enclosing sigils. If we changed from enclosing Compare the
We imposed the deliberately less attractive (For those unfamiliar with recent MF2 history, "2a" does not refer to balloting options, but to syntax options in a previous call for consensus) |
I think the below commentary falls under this umbrella, but if not then feel free to instead spawn a new issue. Accepting simple messages as defined in Delimiting of Patterns in Complex Messages requires starting in "text mode" rather than "code mode" (cf. Syntax Variation Beauty Contest), and therefore looking for declarations and/or selectors to identify a complex message. Declarations in particular will often be followed by line feed, as they are in beauty contest examples. And it's not captured in the latest requirements, but I feel strongly that iterative updates to a message which change its classification (e.g., adding or removing declarations) should not affect treatment of leading or trailing whitespace—cf. #474 (comment)
The relevance to this discussion is that satisfying that requirement would rule out designs in which leading/trailing whitespace is preserved in a simple message but trimmed in [selector-free] complex message patterns or vice versa. I have separate concerns about preserving leading/trailing whitespace at all (which is hostile to using a line-feed-terminated plain text file as the container format)1, but AFAICT that's out of scope for #505. Footnotes
|
P.S. Option 1 requires quoting even for selector-free complex message patterns, right? And if so, would quoting be permitted in a simple message or would the above diff require touching both declarations and pattern quoting? logOutMessage = ```
-{%input username}
-
-{{Log out {$username}?}}
+Log out?
```; |
Here are some statistics from leading spaces in the Android resources:
This was done only on the It is interesting that Amharic has about 3.77 times more strings with leading spaces than English. I don't know if the reason is this or not: But the validation used for our localization checks and reports inconsistencies between leading / trailing spaces between source (English) and target languages. The Amharic translators added a leading space. It is a real phenomenon, and does not look like an accident. So we might accidentally disadvantage some languages because we think we know better. As Addison said yesterday, it is a big world out there. |
We should also be aware of the "chicken and egg" problem. Some years ago the UTC was arguing that Romanian uses both S and T with cedilla below, and S & T with comma below. Pointing to websites and many new books. But at the time the most common OS was Windows 95, which was not Unicode. The S & T with cedilla below were present in the Eastern-European code page (cp-1250), S & T with comma below were Unicode only (didn't exist in a Windows supported codepage). Looking at limited, existing corpus tells us nothing about what is needed, only tells us about current status. To this day doing HTML in Chinese / Japanese, Khmer, Thai (and other languages that don't use spaces) is harder than it is in other languages. In English (and most other languages) you can wrap the text, and HTML rendering will replace it with a space. But if you do that in languages above you end up with spaces in the middle of the text, where they should not be. Another example of technology disadvantaging certain languages. Do we want that for MF2, by design? The best policy for localization is "don't mess with my text!" |
I would be okay with this solution. It seems like a compromise where we always delimit the patterns, and we also avoid having more levels of nested delimiters than people find manageable or tasteful. The key point we're voting on in #505 is about whether we always delimit/quote patterns or not, and I would be okay with this compromise if it would require it.
If it is a choice between disadvantaging a language/group of people vs. disadvantaging a file format, I will pick being fair to all people & languages equally if it inconveniences a file format a little bit. As an i18n API, I think it behooves us to prioritize doing things fairly and consistently for all languages & people, where possible. I'm not sure the context of the Amharic usage patterns, but regardless, I don't want to make assumptions that could affect how people use their language, for the cases that we happen to know now, and the ones that we're not yet unaware of. This is very much part of the reasoning behind Option 1 in #505, so it seems in scope. |
@gibson042 asked:
Option 1 requires that the pattern(s) be quoted in all complex messages, including those with only declarations.
Patterns are currently allowed by the ABNF to consist solely of a quoted pattern (in code mode).
Adopting a code-mode sigil makes this more attractive:
Simple patterns are not trimmed. In Option 1, quoted patterns are not trimmed internally either (but the quotes have to be added). Options 3 and 4 both also require pattern quoting (of some sort) to avoid changing the trim state of the pattern. Option 4 might require whitespace quoting in simple patterns (as a somewhat obvious extension) |
So messages
Continuing discussion on the basis of that final assertion...
In addition to the above issue, that creates iteration hazards: adding a declaration to an unquoted simple message must be preceded by quoting it, but adding a declaration to a quoted simple message must not make such content changes (and therefore the agent making the change must distinguish between quoted vs. unquoted simple messages). Option 1 requiring quotes around patterns in complex messages but not in simple messages changes my ranking of it from a mere dispreference to an objection. From my perspective, unquoted simple messages are only coherent when interpreted as unquoted patterns subject to trimming, because of the way that declarations manifest as in-band insertions. "Simple message" vs. "complex message" is workable (if unnecessary) taxonomy complexity, but "unquoted simple message" vs. "quoted simple message" vs. "complex message" is a bridge too far. |
No. Messages like
Simple messages are unquoted patterns by definition. That is (chair hat on)
If you wish to change your vote, please do so by editing your existing vote comment. It is permitted to omit an item from your ranking. |
I am not sure that Bringing it up here can only muddles the waters and change peoples' votes. |
I'd say that noting that the patterns |
Again, I don't think So trying to put that against option 3 (which is "oh so clean") is just trying to misrepresent things to scare people into voting against 1. |
Are you proposing that if option 1 is selected, we should add a new data model error to the spec? Something along the lines of:
Or were you thinking of making it a syntax error? |
My thinking was that we keep the current Option 0 code mode syntax for everything, except if the first non-whitespace character is not In other words, we introduce a very narrow condition when the |
@sffc That was more or less my thinking, although I would probably require the code-mode sigil to get to a quoted pattern. If we allow quoted-pattern to start, that's a lot of parsing states (for the machine and for users). The doubled-brackets have grown on me as pattern quotes since starting to fool with them. I think changing the code mode sigil from paired bracket sets to something like @gibson042 has a point that some folks will force the complex wrapper on simple messages and for that option 1 is less pretty, even with reduced syntactical sugaring:
Allowing quoted-pattern into simple messages makes that:
Option 4 has some appeal in that there is only one recipe for pattern and that the whitespace we're all discussing requires some thought (intentionality). PEWS (and only PEWS) is never an accident in 4. |
I very much agree with the general feeling! But the case of Amharic, at least with the numbers above (0.34%, vs. 0.21% average and 0.09% for English) doesn't really justify anything. If Amharic translators have to do 0.13% more work than for spaces than the average, that's completely negligible given that in many languages, text length or word count expand by 10% or more on translation. And that's counting adding the spaces and surrounding the syntax at the same weight as the translation itself, which is surely an overestimation. |
Done, but the new ranking is actually contingent on support for untrimmed unquoted simple messages. If Option 1 truly were "always quote patterns", including in simple messages, then I would consider it acceptable. |
(chair hat off) I'm curious about your reasoning. "Always quote patterns" would be our old syntax, which always starts in code mode. Can you explain why you feel it's better to always quote the pattern? Unquoted simple patterns seem really straightforward to use and meet the majority of message cases. |
Full reasoning is explained in #507 (comment) , but the gist of it is that I don't feel it's better to always quote the pattern—rather that the combination of untrimmed unquoted simple messages along with required quoting in complex messages is incoherent and hazardous to message iteration (in part because a message must first be classified as simple vs. complex by examining its contents). My top preference is supporting and trimming unquoted patterns in both simple and complex messages, and I can live with a model in which patterns must always be quoted in both (preserving all contents inside the quotes), but breaking that internal consistency creates too many problems for me to feel comfortable with. |
@gibson042 Is your concern satisfied by my proposal where simple messages permit a single quoted pattern? Then there is no syntactic difference between simple and complex messages so long as the host has a way to normalize simple messages to be quoted. |
I agree that when we look at one language / script the numbers might be small. But over the last few weeks we've heard arguments trying to minimize the problem with spaces:
Requirements to escape certain characters, automated trimming, changing case, in any kind of automatic operations on text provided by translators, have been source of many bugs. |
If complex messages require quoting patterns but simple messages do not, then there is a syntactic difference and the problem still exists (and is even exacerbated if quoting a known simple message requires more than just wrapping it in relevant syntax). |
The normalization algorithm in my proposal would be just function normalizeMessage(message) {
let message = trim(message);
if (message[0] !== "{" && message[0] !== "#") {
message = "{" + message + "}";
}
return message;
} That seems like a fairly easy function to run in order to get messages into the same syntax. I don't see the practical issue with that. |
(chair 🎩 off) @gibson042 Thanks for this. So, to ensure I understand, what you're saying is: you'd prefer that simple patterns were trimmed (except where quoted) because complex ones are. I think that's a separate proposal from the one we're currently balloting? That is, 1/3/4 are concerned with finding the pattern boundaries in complex messages. Depending on which we choose, we could go on to make simple patterns consistent with the quoting behavior of the chosen alternative. For me, I think that I might want simple messages to trim if we chose 3 or 4 (because that would be consistent with pattern handling in each), but I'm not as sure about 1. In option 1, the pattern is quoted in complex messages and not trimmed inside the quotes. Simple messages in option 1 are quoted by their container, e.g.:
If your first line is If we adopted what I think @gibson042 is suggesting, you still would not trim until inside the if statement, e.g.:
Also, you have to check if the message starts with a variable (and we allow optional whitespace there, as @eemeli has pointed out), which is why the double-sigil pattern quotes ( |
Oh that's a good point. I guess the function needs to be slightly more complex in order to check for that case. But, my overall point remains that one could construct a fairly simple function in order to transform a simple message to a complex message, which I hope addresses Gibson's concern. If requiring a normalization function doesn't address that concern, then I can see a valid argument that the quotation rules should be the same between simple and complex messages. |
Yes, that seems about right—the proposal currently subject to balloting relates to other design aspects in ways that deeply affect my assessment. So I have voted narrowly in #505 according to the instructions, but provided more detail here.
Yes, that is the combination of behaviors that I find objectionable because of how it complicates transitions between simple and complex messages.
My overall point is that such a function is not sufficiently simple, as demonstrated in this very conversation. |
At some point, the code needs to figure out whether the message is simple or complex. The problem is that we start in a weird parse state where we don't know whether the message is simple or complex, so we need to read the message until we make that determination, then rewind to the beginning and start over in the correct mode. If you use |
To be clear, I brought up not normalization but coherence and iteration hazards [in the textual representation]. But if it is assumed that message manipulation will be performed on data structures rather than text, then there should be only one source format anyway—optimized for machines rather than people when those are in conflict, and with nothing like a "simple" variant. |
@sffc said:
Huh? If the first two characters are the code-mode shift, we progress from there (as a complex message), otherwise we're in a simple message. You don't have to read the whole message to figure that out. |
One way to think about consistency: Same as
and
In the second case the Our bnf can be |
@mihnita My objection is based on the practical consequences of quoting syntax being absent in a "simple message" but mandatory in a "complex message" while any given message may fall into either category, and is not affected by whether or not that syntax appears inside or outside of the productions for a |
Some summary I came up with a day or two ago, trying to understand inconsistencies, and how to choose.
Complete set:
I drop all combinations that contain [ trim, wrap ], because nobody wants that.
Then we remove all cases of simple with wrap, because this is what we try to improve now, delimiting messages like "Hello world"
We also take out simple trim no-wrap: And we are left with these
So from what we see above, both options 1 and 3 have inconsistencies between simple and complex. No matter what, we will have one inconsistency. I would argue that the second kind of inconsistency (option 3) is worse. |
This is only true if either:
|
I think I understand your argument for quoting simple messages. But this can of worms is already opened. I think that most developers will not work that much with mf2 syntax to have a "mental model" and so on. For me, this is a DSL designed for i18n / l10n. |
In fact is uses
Although visually ugly, the reason to use For example: But I would not be against using Example (MF2):
Now put that in a format that uses
So it is best if the characters to use as delimiters are rarely used in human messages, and rare in existing "storage formats" But I can live with something like |
@mihnita You keep using examples like this, but I think you're relying upon some unstated assumptions. In what context does the hypothetical you «get
And as designers, it's our responsibility to minimize such quirks.
So you are expecting translators to work with MF2 messages directly? That seems like a population that would be even less comfortable with having to understand that quoting patterns is mandatory in one context and forbidden in another, let alone with understanding the nuance of how to classify any particular message into the appropriate category. If that is indeed the case, then I would push for requiring all patterns to be quoted and dropping the "simple message" concept. But if instead we take a position like "message contents are for developers and pattern contents are for translators", then there are some coherent designs which can include a "simple message" concept (specifically, Option 1 with universally mandatory pattern quoting and Option 3 or 4 with universal whitespace trimming). I suppose what I'm really objecting to is a syntactic distinction between "simple message" and "complex message", as opposed to a semantic distinction (in which e.g. a declaration may be prepended to any message, the result being a complex message even if the original message is simple, but its output being unaffected in either case). It's just that taking support for unquoted simple messages as a given drives #505 to options 3 and 4. |
How about Simple messages:
Quoted simple messages:
Complex messages:
|
@sffc mentioned:
Why not? It would require that I originally proposed
I'm really much happier with If we want to make pattern quotes be single @mihnita mentioned:
It was one of the five original options (pre-balloting)--it is the "missing" As @mihnita noted, we made the current syntax slightly uglier so that we could get to the problem of pattern whitespace/quoting. Once the balloting is done, we can clean the resulting syntax. That will require specific proposals, albeit ones we've already been entertaining in other threads. Personally, I'm tempted to go change my vote, because there is a certain parsimony to:
I'm a bit mystified about the desire to quote simple messages. I get that machines will want to homogenize messages and sometimes will be brutal in doing so. But, to @eemeli's point, if you're using a data model internally, it's pretty simple to tell if you need to quote the whole pattern when serializing. If there are no declarations or selectors, just emit the pattern! I also think it is totally valid to code-mode-and-quote the pattern with no declarations or selectors. But I wouldn't be excited about an implementation that did that by default. It would be sort of like always using MF1 like this:
@gibson042 noted:
Unquoted patterns lend themselves to this better than option 1 does:
Admittedly the declaration "affects the output"--in a positive way. A translator might very much want to add a declaration to get specific options needed for their language with the person name formatter! |
The last of these gets at what I'm talking about—it only satisfies the constraint if behavior includes trimming whitespace (otherwise For comparison, here is the equivalent in Option 1 with universally mandatory pattern quoting:
And note that the constraint is bidirectional, so Option 3 only works if pattern quoting is optional in complex messages and simple messages (i.e., that both of the above code blocks represent a valid simple message and two variations of equivalent complex analogs in Option 3, and therefore that every parser can differentiate unquoted simple message vs. quoted simple message vs. complex message). |
I think we are, because parsing moves to another state, and that state is clearly defined. |
I agree with that. |
They are not assumptions, and have been stated before. Chinese uses spaces in front of names ("honorific spaces"). So
But often (and especially for simple messages) tools have no way to know.
no human nor tool can tell if that string is used as is, goes through some kind of printf API, or the jdk MessagesFormat, the ICU one (mf1) or MF2 And the way things are adopted, it is in fact common to see various kinds of string formats in the same file (going through different APIs) |
The assumptions I'm referring to are not regarding the language in use, but rather the nature of the actor that gets
Do all of those systems apply the same treatment to relevant whitespace? Do they all use the same syntax to ensure that whitespace is treated as significant? The answer is almost certainly "no", and is certainly "no" for the related questions about increasingly sophisticated input. And it's not possible to impose consistency upon an already inconsistent heterogeneous environment. I just don't understand what kind of MF2 design you expect to accommodate meaningful edits of input that may or may not be sent to it, or how you expect anyone to meaningfully edit input that might be sent to MF2 directly, to MF2 after transformation, or to something entirely separate from MF2. EDIT: @mihnita perhaps you can define the subset of e.g. |
This failure of the .properties format to support identifying the format used in its contained messages sounds like something we should account for in the message resource spec rather than the message spec. |
I've come to prefer double curlies for patterns too, even if originally we picked them to make the syntax "uglier." I find it easier to identify patterns when they're wrapped in
I realize we don't have a good write-up about text-first mode, do we? I filed #512 to track it. |
Thanks for bringing up iteration hazards, I've been also worried about them, and I think we should attempt to build a catalogue of "user journeys" that capture them.
How is IIUC, your preference for option 3 or 4 is contingent on trimming whitespace universally. If this wasn't the case, i.e. we opted for preserving whitespace in simple patterns and trimming in complex, would that change your preference? I know you were concerned about the need to additionally add/remove pattern delimiters to a simple pattern when changing declarations, but I think it can be argued that in option 1, the iteration hazard is less significant because it needs to be explicitly addressed (at the cost of an extra effort). |
The former is technically a subset of the latter... "Option 1 with universally mandatory pattern quoting" could involve dropping the simple message concept (such that e.g.
I agree that creating such a large amount of friction can work, but as observed above, it seems worse than just eliminating the concept of simple messages. A syntactic split with in-band differentiation is complicated for both machines and people, and I'm just not seeing how preservation of leading/trailing whitespace by MessageFormat is valuable enough to justify it—especially since invisible content is bad in general, and in my opinion errors from such a decision (e.g., inadvertent inclusion of trailing spaces and/or line feeds) would exceed intentional use by an order of magnitude. I think instead that always trimming leading/trailing whitespace while maintaining a mechanism to explicitly include it, as in Options 3 and 4 (and apparently also in Footnotes
|
I think one of the things that can be overlooked here is that MessageFormat is an embedded syntax. That is, the message will always be stored in some other format (properties, po, programming language source code, etc.). That external container will include the means of identifying the external boundaries of the message itself. A lot of digital ink has been spilled around the There is interplay here, which is why we avoid single- and double-quotes in our syntax and also why we do not define Unicode escapes such as Most storage syntaxes provide visible delimiters of some sort. In those cases, the external spaces on a simple message are visible. This does not mean that we are required to preserve those spaces. We could choose to trim the message (include simple messages). But we are not required to do so. I think this is a very useful debate. One of my audience participation slides for Tuesday at the Unicode Technical Workshop actually asks this question: I tend to agree that non-trimming will result in at least some cases with unintentional whitespace inclusion... just as a trimming regime will result in at least some cases with unintentional whitespace removal.
The option 1 example doesn't say that. A bare log-out message in option 1 would look like:
If we beautify the syntax as suggested elsewhere, it could be:
...but most processors probably notice that lack of declarations and selectors and remove the decoration? Also, note that all three options permit this (whitespace bearing) Option-4-style message:
|
I've tried to remain very conscious of that, and am thinking about scenarios where the container is a file (which almost always end with a line feed, in part because tools like git complain otherwise) or a multi-line string in some programming language (which frequently start and end with line feed because that makes for more readable source, as in my original example): logOutMessage = ```
-{%input username}
-
-Log out {$username}?
+Log out?
```; (before:
Agreed, and I would again emphasize that nothing requires a direct connection from
Excellent! But the particular example does seem a bit biased because those spaces are unexpected... might I recommend also including one like mine, where the trimmable whitespace includes line feeds?
I was trying to match the Option 1 example syntax for a message that has one declaration and no selector... does |
After a lot of hesitation, I voted The fact that Option 1 effectively introduces two syntaxes is not ideal, but it's a tradeoff between simplicity and convenience that isn't unheard of in programming languages. For instance, in JavaScript, arrow functions can have a simple expression body with an implied It's also a way to avoid iteration hazards: adding a declaration requires specifying the pattern boundary explicitly. In Option 3, OTOH, adding a declaration changes the trimming behavior -- unless we also autotrim simple patterns, which I don't think we should. |
2023-11-06 Group consensus is now option 1. |
We are currently balloting this issue in #505. Use this issue for any technical discussion or questions.
Some useful links:
The text was updated successfully, but these errors were encountered: