Skip to content

Latest commit

 

History

History
975 lines (776 loc) · 36.7 KB

syntax.md

File metadata and controls

975 lines (776 loc) · 36.7 KB

DRAFT MessageFormat 2.0 Syntax

Table of Contents

[TBD]

Introduction

This section defines the formal grammar describing the syntax of a single message.

Design Goals

This section is non-normative.

The design goals of the syntax specification are as follows:

  1. The syntax should leverage the familiarity with ICU MessageFormat 1.0 in order to lower the barrier to entry and increase the chance of adoption. At the same time, the syntax should fix the pain points of ICU MessageFormat 1.0.

    • Non-Goal: Be backwards-compatible with the ICU MessageFormat 1.0 syntax.
  2. The syntax inside translatable content should be easy to understand for humans. This includes making it clear which parts of the message body are translatable content, which parts inside it are placeholders for expressions, as well as making the selection logic predictable and easy to reason about.

    • Non-Goal: Make the syntax intuitive enough for non-technical translators to hand-edit. Instead, we assume that most translators will work with MessageFormat 2 by means of GUI tooling, CAT workbenches etc.
  3. The syntax surrounding translatable content should be easy to write and edit for developers, localization engineers, and easy to parse by machines.

  4. The syntax should make a single message easily embeddable inside many container formats: .properties, YAML, XML, inlined as string literals in programming languages, etc. This includes a future MessageResource specification.

    • Non-Goal: Support unnecessary escape sequences, which would theirselves require additional escaping when embedded. Instead, we tolerate direct use of nearly all characters (including line breaks, control characters, etc.) and rely upon escaping in those outer formats to aid human comprehension (e.g., depending upon container format, a U+000A LINE FEED might be represented as \n, \012, \x0A, \u000A, \U0000000A, &#xA;, &NewLine;, %0A, <LF>, or something else entirely).

Design Restrictions

This section is non-normative.

The syntax specification takes into account the following design restrictions:

  1. Whitespace outside the translatable content should be insignificant. It should be possible to define a message entirely on a single line with no ambiguity, as well as to format it over multiple lines for clarity.

  2. The syntax should define as few special characters and sigils as possible. Note that this necessitates extra care when presenting messages for human consumption, because they may contain invisible characters such as U+200B ZERO WIDTH SPACE, control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters (U+FDD0 through U+FDEF and U+nFFFE and U+nFFFF where n is 0x0 through 0x10), private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and U+100000 through U+10FFFD), unassigned code points, and other potentially confusing content.

Messages and their Syntax

The purpose of MessageFormat is to allow content to vary at runtime. This variation might be due to placing a value into the content or it might be due to selecting a different bit of content based on some data value or it might be due to a combination of the two.

MessageFormat calls the template for a given formatting operation a message.

The values passed in at runtime (which are to be placed into the content or used to select between different content items) are called external variables. The author of a message can also assign local variables, including variables that modify external variables.

This part of the MessageFormat specification defines the syntax for a message, along with the concepts and terminology needed when processing a message during the formatting of a message at runtime.

The complete formal syntax of a message is described by the ABNF.

Well-formed vs. Valid Messages

A message is well-formed if it satisfies all the rules of the grammar. Attempting to parse a message that is not well-formed will result in a Syntax Error.

A message is valid if it is well-formed and also meets the additional content restrictions and semantic requirements about its structure defined below for declarations, matcher, and options. Attempting to parse a message that is not valid will result in a Data Model Error.

The Message

A message is the complete template for a specific message formatting request.

A variable is a name associated to a resolved value.

An external variable is a variable whose name and initial value are supplied by the caller to MessageFormat or available in the formatting context. Only an external variable can appear as an operand in an input declaration.

A local variable is a variable created as the result of a local declaration.

Note

This syntax is designed to be embeddable into many different programming languages and formats. As such, it avoids constructs, such as character escapes, that are specific to any given file format or processor. In particular, it avoids using quote characters common to many file formats and formal languages so that these do not need to be escaped in the body of a message.

Note

In general (and except where required by the syntax), whitespace carries no meaning in the structure of a message. While many of the examples in this spec are written on multiple lines, the formatting shown is primarily for readability.

Example This message:

.local $foo   =   { |horse| }
{{You have a {$foo}!}}

Can also be written as:

.local $foo={|horse|}{{You have a {$foo}!}}

An exception to this is: whitespace inside a pattern is always significant.

Note

The MessageFormat 2 syntax assumes that each message will be displayed with a left-to-right display order and be processed in the logical character order. The syntax permits the use of right-to-left characters in identifiers, literals, and other values. This can result in confusion when viewing the message or users might incorrectly insert bidi controls or marks that negatively affect the output of the message.

To assist with this, the syntax permits the use of various controls and strongly-directional markers in both optional and required whitespace in a message, as well was encouraging the use of isolating controls with expressions and quoted patterns. See: whitespace (below) for more information.

Additional restrictions or requirements might be added during the Tech Preview to better manage bidirectional text.

A message can be a simple message or it can be a complex message.

message = simple-message / complex-message

A simple message contains a single pattern, with restrictions on its first non-whitespace character. An empty string is a valid simple message.

Whitespace at the start or end of a simple message is significant, and a part of the text of the message.

simple-message = o [simple-start pattern]
simple-start   = simple-start-char / escaped-char / placeholder

A complex message is any message that contains declarations, a matcher, or both. A complex message always begins with either a keyword that has a . prefix or a quoted pattern and consists of:

  1. an optional list of declarations, followed by
  2. a complex body

Whitespace at the start or end of a complex message is not significant, and does not affect the processing of the message.

complex-message = o *(declaration o) complex-body o

Declarations

A declaration binds a variable identifier to a value within the scope of a message. This variable can then be used in other expressions within the same message. Declarations are optional: many messages will not contain any declarations.

An input-declaration binds a variable to an external input value. The variable-expression of an input-declaration MAY include a function that is applied to the external value.

A local-declaration binds a variable to the resolved value of an expression.

declaration       = input-declaration / local-declaration
input-declaration = input o variable-expression
local-declaration = local s variable o "=" o expression

Variables, once declared, MUST NOT be redeclared. A message that does any of the following is not valid and will produce a Duplicate Declaration error during processing:

  • A declaration MUST NOT bind a variable that appears as a variable anywhere within a previous declaration.
  • An input-declaration MUST NOT bind a variable that appears anywhere within the function of its variable-expression.
  • A local-declaration MUST NOT bind a variable that appears in its expression.

A local-declaration MAY overwrite an external input value as long as the external input value does not appear in a previous declaration.

Note

These restrictions only apply to declarations. A placeholder can apply a different function to a variable than one applied to the same variable named in a declaration. For example, this message is valid:

.input {$var :number maximumFractionDigits=0}
.local $var2 = {$var :number maximumFractionDigits=2}
.match $var2
0 {{The selector can apply a different function to {$var} for the purposes of selection}}
* {{A placeholder in a pattern can apply a different function to {$var :number maximumFractionDigits=3}}}

(See the Errors section for examples of invalid messages)

Complex Body

The complex body of a complex message is the part that will be formatted. The complex body consists of either a quoted pattern or a matcher.

complex-body = quoted-pattern / matcher

Pattern

A pattern contains a sequence of text and placeholders to be formatted as a unit. Unless there is an error, resolving a message always results in the formatting of a single pattern.

pattern = *(text-char / escaped-char / placeholder)

A pattern MAY be empty.

A pattern MAY contain an arbitrary number of placeholders to be evaluated during the formatting process.

Quoted Pattern

A quoted pattern is a pattern that is "quoted" to prevent interference with other parts of the message. A quoted pattern starts with a sequence of two U+007B LEFT CURLY BRACKET {{ and ends with a sequence of two U+007D RIGHT CURLY BRACKET }}.

quoted-pattern = o "{{" pattern "}}"

A quoted pattern MAY be empty.

An empty quoted pattern:

{{}}

Text

text is the translateable content of a pattern. Any Unicode code point is allowed, except for U+0000 NULL and the surrogate code points U+D800 through U+DFFF inclusive. The characters U+005C REVERSE SOLIDUS \, U+007B LEFT CURLY BRACKET {, and U+007D RIGHT CURLY BRACKET } MUST be escaped as \\, \{, and \} respectively.

In the ABNF, text is represented by non-empty sequences of simple-start-char, text-char, escaped-char, and s. The production simple-start-char represents the first non-whitespace in a simple message and matches text-char except for not allowing U+002E FULL STOP .. The ABNF uses content-char as a shared base for text and quoted literal characters.

Whitespace in text, including tabs, spaces, and newlines is significant and MUST be preserved during formatting.

simple-start-char = content-char / "@" / "|"
text-char         = content-char / ws / "." / "@" / "|"
quoted-char       = content-char / ws / "." / "@" / "{" / "}"
content-char      = %x01-08        ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
                  / %x0B-0C        ; omit CR (%x0D)
                  / %x0E-1F        ; omit SP (%x20)
                  / %x21-2D        ; omit . (%x2E)
                  / %x2F-3F        ; omit @ (%x40)
                  / %x41-5B        ; omit \ (%x5C)
                  / %x5D-7A        ; omit { | } (%x7B-7D)
                  / %x7E-2FFF      ; omit IDEOGRAPHIC SPACE (%x3000)
                  / %x3001-D7FF    ; omit surrogates
                  / %xE000-10FFFF

When a pattern is quoted by embedding the pattern in curly brackets, the resulting message can be embedded into various formats regardless of the container's whitespace trimming rules. Otherwise, care must be taken to ensure that pattern-significant whitespace is preserved.

Example In a Java .properties file, the values hello and hello2 both contain an identical message which consists of a single pattern. This pattern consists of text with exactly three spaces before and after the word "Hello":

hello = {{   Hello   }}
hello2=\   Hello  \ 

Placeholder

A placeholder is an expression or markup that appears inside of a pattern and which will be replaced during the formatting of a message.

placeholder = expression / markup

Matcher

A matcher is the complex body of a message that allows runtime selection of the pattern to use for formatting. This allows the form or content of a message to vary based on values determined at runtime.

A matcher consists of the keyword .match followed by at least one selector and at least one variant.

When the matcher is processed, the result will be a single pattern that serves as the template for the formatting process.

A message can only be considered valid if the following requirements are satisfied; otherwise, a corresponding Data Model Error will be produced during processing:

  • Variant Key Mismatch: The number of keys on each variant MUST be equal to the number of selectors.
  • Missing Fallback Variant: At least one variant MUST exist whose keys are all equal to the "catch-all" key *.
  • Missing Selector Annotation: Each selector MUST be a variable that directly or indirectly references a declaration with a function.
  • Duplicate Variant: Each variant MUST use a list of keys that is unique from that of all other variants in the message. Literal keys are compared by their contents, not their syntactical appearance.
matcher         = match-statement s variant *(o variant)
match-statement = match 1*(s selector)

A message with a matcher:

.input {$count :number}
.match $count
one {{You have {$count} notification.}}
*   {{You have {$count} notifications.}}

A message containing a matcher formatted on a single line:

.local $os = {:platform} .match $os windows {{Settings}} * {{Preferences}}

Selector

A selector is a variable whose resolved value ranks or excludes the variants based on the value of the corresponding key in each variant. The combination of selectors in a matcher thus determines which pattern will be used during formatting.

selector = variable

There MUST be at least one selector in a matcher. There MAY be any number of additional selectors.

A message with a single selector that uses a custom function :hasCase which is a selector that allows the message to choose a pattern based on grammatical case:

.local $hasCase = {$userName :hasCase}
.match $hasCase
vocative {{Hello, {$userName :person case=vocative}!}}
accusative {{Please welcome {$userName :person case=accusative}!}}
* {{Hello!}}

A message with two selectors:

.input {$numLikes :integer}
.input {$numShares :integer}
.match $numLikes $numShares
0   0   {{Your item has no likes and has not been shared.}}
0   one {{Your item has no likes and has been shared {$numShares} time.}}
0   *   {{Your item has no likes and has been shared {$numShares} times.}}
one 0   {{Your item has {$numLikes} like and has not been shared.}}
one one {{Your item has {$numLikes} like and has been shared {$numShares} time.}}
one *   {{Your item has {$numLikes} like and has been shared {$numShares} times.}}
*   0   {{Your item has {$numLikes} likes and has not been shared.}}
*   one {{Your item has {$numLikes} likes and has been shared {$numShares} time.}}
*   *   {{Your item has {$numLikes} likes and has been shared {$numShares} times.}}

Variant

A variant is a quoted pattern associated with a list of keys in a matcher. Each variant MUST begin with a sequence of keys, and terminate with a valid quoted pattern. The number of keys in each variant MUST match the number of selectors in the matcher.

Each key is separated from each other by whitespace. Whitespace is permitted but not required between the last key and the quoted pattern.

variant = key *(s key) quoted-pattern
key     = literal / "*"

Key

A key is a value in a variant for use by a selector when ranking or excluding variants during the matcher process. A key can be either a literal value or the "catch-all" key *.

The catch-all key is a special key, represented by *, that matches all values for a given selector.

The value of each key MUST be treated as if it were in Unicode Normalization Form C ("NFC"). Two keys are considered equal if they are canonically equivalent strings, that is, if they consist of the same sequence of Unicode code points after Unicode Normalization Form C has been applied to both.

Expressions

An expression is a part of a message that will be determined during the message's formatting.

An expression MUST begin with U+007B LEFT CURLY BRACKET { and end with U+007D RIGHT CURLY BRACKET }. An expression MUST NOT be empty. An expression cannot contain another expression. An expression MAY contain one more attributes.

A literal-expression contains a literal, optionally followed by a function.

A variable-expression contains a variable, optionally followed by a function.

A function-expression contains a function without an operand.

expression          = literal-expression
                    / variable-expression
                    / function-expression
literal-expression  = "{" o literal [s function] *(s attribute) o "}"
variable-expression = "{" o variable [s function] *(s attribute) o "}"
function-expression = "{" o function *(s attribute) o "}"

There are several types of expression that can appear in a message. All expressions share a common syntax. The types of expression are:

  1. The value of a local-declaration
  2. A kind of placeholder in a pattern

Additionally, an input-declaration can contain a variable-expression.

Examples of different types of expression

Declarations:

.input {$x :function option=value}
.local $y = {|This is an expression|}

Placeholders:

This placeholder contains a literal expression: {|literal|}
This placeholder contains a variable expression: {$variable}
This placeholder references a function on a variable: {$variable :function with=options}
This placeholder contains a function expression with a variable-valued option: {:function option=$variable}

Operand

An operand is the literal of a literal-expression or the variable of a variable-expression.

Function

A function is named functionality in an expression. Functions are used to evaluate, format, select, or otherwise process data values during formatting.

A function can appear in an expression by itself or following a single operand. When following an operand, the operand serves as input to the function.

Each function is defined by the runtime's function registry. A function's entry in the function registry will define whether the function is a selector or formatter (or both), whether an operand is required, what form the values of an operand can take, what options and option values are acceptable, and what outputs might result. See function registry for more information.

A function starts with a prefix sigil : followed by an identifier. The identifier MAY be followed by one or more options. Options are not required.

function = ":" identifier *(s option)

A message with a function operating on the variable $now:

It is now {$now :datetime}.
Options

An option is a key-value pair containing a named argument that is passed to a function.

An option has an identifier and a value. The identifier is separated from the value by an U+003D EQUALS SIGN = along with optional whitespace. The value of an option can be either a literal or a variable.

Multiple options are permitted in a function. Options are separated from the preceding function identifier and from each other by whitespace. Each option's identifier MUST be unique within the function: a function with duplicate option identifiers is not valid and will produce a Duplicate Option Name error during processing.

The order of options is not significant.

option = identifier o "=" o (literal / variable)

Examples of functions with options

A message using the :datetime function. The option weekday has the literal long as its value:

Today is {$date :datetime weekday=long}!

A message using the :datetime function. The option weekday has a variable $dateStyle as its value:

Today is {$date :datetime weekday=$dateStyle}!

Markup

Markup placeholders are pattern parts that can be used to represent non-language parts of a message, such as inline elements or styling that should apply to a span of parts.

Markup MUST begin with U+007B LEFT CURLY BRACKET { and end with U+007D RIGHT CURLY BRACKET }. Markup MAY contain one more attributes.

Markup comes in three forms:

Markup-open starts with U+0023 NUMBER SIGN # and represents an opening element within the message, such as markup used to start a span. It MAY include options.

Markup-standalone starts with U+0023 NUMBER SIGN # and has a U+002F SOLIDUS / immediately before its closing } representing a self-closing or standalone element within the message. It MAY include options.

Markup-close starts with U+002F SOLIDUS / and is a pattern part ending a span.

markup = "{" o "#" identifier *(s option) *(s attribute) o ["/"] "}"  ; open and standalone
       / "{" o "/" identifier *(s option) *(s attribute) o "}"  ; close

A message with one button markup span and a standalone img markup element:

{#button}Submit{/button} or {#img alt=|Cancel| /}.

A message containing markup that uses options to pair two closing markup placeholders to the one open markup placeholder:

{#ansi attr=|bold,italic|}Bold and italic{/ansi attr=|bold|} italic only {/ansi attr=|italic|} no formatting.}

A markup-open can appear without a corresponding markup-close. A markup-close can appear without a corresponding markup-open. Markup placeholders can appear in any order without making the message invalid. However, specifications or implementations defining markup might impose requirements on the pairing, ordering, or contents of markup during formatting.

Attributes

An attribute is an identifier with an optional value that appears in an expression or in markup. During formatting, attributes have no effect, and they can be treated as code comments.

Attributes are prefixed by a U+0040 COMMERCIAL AT @ sign, followed by an identifier. An attribute MAY have a literal value which is separated from the identifier by an U+003D EQUALS SIGN = along with optional whitespace.

Multiple attributes are permitted in an expression or markup. Each attribute is separated by whitespace.

Each attribute's identifier SHOULD be unique within the expression or markup: all but the last attribute with the same identifier are ignored. The order of attributes is not otherwise significant.

attribute = "@" identifier [o "=" o literal]

Examples of expressions and markup with attributes:

A message including a literal that should not be translated:

In French, "{|bonjour| @translate=no}" is a greeting

A message with markup that should not be copied:

Have a {#span @can-copy}great and wonderful{/span @can-copy} birthday!

Other Syntax Elements

This section defines common elements used to construct messages.

Keywords

A keyword is a reserved token that has a unique meaning in the message syntax.

The following three keywords are defined: .input, .local, and .match. Keywords are always lowercase and start with U+002E FULL STOP ..

input = %s".input"
local = %s".local"
match = %s".match"

Literals

A literal is a character sequence that appears outside of text in various parts of a message. A literal can appear as a key value, as the operand of a literal-expression, or in the value of an option. A literal MAY include any Unicode code point except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF.

All code points are preserved.

Important

Most text, including that produced by common keyboards and input methods, is already encoded in the canonical form known as Unicode Normalization Form C ("NFC"). A few languages, legacy character encoding conversions, or operating environments can result in literal values that are not in this form. Some uses of literals in MessageFormat, notably as the value of keys, apply NFC to the literal value during processing or comparison. While there is no requirement that the literal value actually be entered in a normalized form, users are cautioned to employ the same character sequences for equivalent values and, whenever possible, ensure literals are in NFC.

A quoted literal begins and ends with U+005E VERTICAL BAR |. The characters \ and | within a quoted literal MUST be escaped as \\ and \|.

An unquoted literal is a literal that does not require the | quotes around it to be distinct from the rest of the message syntax. An unquoted literal MAY be used when the content of the literal contains no whitespace and otherwise matches the unquoted production. Implementations MUST NOT distinguish between quoted literals and unquoted literals that have the same sequence of code points.

Unquoted literals can contain a name or consist of a number-literal. A number-literal uses the same syntax as JSON and is intended for the encoding of number values in operands or options, or as keys for variants.

literal          = quoted-literal / unquoted-literal
quoted-literal   = "|" *(quoted-char / escaped-char) "|"
unquoted-literal = name / number-literal
number-literal   = ["-"] (%x30 / (%x31-39 *DIGIT)) ["." 1*DIGIT] [%i"e" ["-" / "+"] 1*DIGIT]

Names and Identifiers

A name is a character sequence used in an identifier or as the name for a variable or the value of an unquoted literal.

A name can be preceded or followed by bidirectional marks or isolating controls to aid in presenting names that contain right-to-left or neutral characters. These characters are not part of the value of the name and MUST be treated as if they were not present when matching name or identifier strings or unquoted literal values.

Variable names are prefixed with $.

Two names are considered equal if they are canonically equivalent strings, that is, if they consist of the same sequence of Unicode code points after Unicode Normalization Form C ("NFC") has been applied to both.

Note

Implementations are not required to normalize all names. Comparisons of name values only need be done "as-if" normalization has occured. Since most text in the wild is already in NFC and since checking for NFC is fast and efficient, implementations can often substitute checking for actually applying normalization to name values.

Valid content for names is based on Namespaces in XML 1.0's NCName. This is different from XML's Name in that it MUST NOT contain a U+003A COLON :. Otherwise, the set of characters allowed in a name is large.

Note

External variables can be passed in that are not valid names. Such variables cannot be referenced in a message, but are not otherwise errors.

An identifier is a character sequence that identifies a function, markup, or option. Each identifier consists of a name optionally preceeded by a namespace. When present, the namespace is separated from the name by a U+003A COLON :. Built-in functions and their options do not have a namespace identifier.

The namespace u (U+0075 LATIN SMALL LETTER U) is reserved for future standardization.

Function identifiers are prefixed with :. Markup identifiers are prefixed with # or /. Option identifiers have no prefix.

Examples:

A variable:

This has a {$variable}

A function:

This has a {:function}

An add-on function from the icu namespace:

This has a {:icu:function}

An option and an add-on option:

This has {:options option=value icu:option=add_on}

Support for namespaces and their interpretation is implementation-defined in this release.

variable   = "$" name
option     = identifier o "=" o (literal / variable)

identifier = [namespace ":"] name
namespace  = name
name       = [bidi] name-start *name-char [bidi]
name-start = ALPHA / "_"
           / %xC0-D6 / %xD8-F6 / %xF8-2FF
           / %x370-37D / %x37F-61B / %x61D-1FFF / %x200C-200D
           / %x2070-218F / %x2C00-2FEF / %x3001-D7FF
           / %xF900-FDCF / %xFDF0-FFFC / %x10000-EFFFF
name-char  = name-start / DIGIT / "-" / "."
           / %xB7 / %x300-36F / %x203F-2040

Escape Sequences

An escape sequence is a two-character sequence starting with U+005C REVERSE SOLIDUS \.

An escape sequence allows the appearance of lexically meaningful characters in the body of text or quoted literal sequences. Each escape sequence represents the literal character immediately following the initial \.

escaped-char = backslash ( backslash / "{" / "|" / "}" )
backslash    = %x5C ; U+005C REVERSE SOLIDUS "\"

Note

The escaped-char rule allows escaping some characters in places where they do not need to be escaped, such as braces in a quoted literal. For example, |foo {bar}| and |foo \{bar\}| are synonymous.

When writing or generating a message, escape sequences SHOULD NOT be used unless required by the syntax. That is, inside literals only escape | and inside patterns only escape { and }.

Whitespace

The syntax limits whitespace characters outside of a pattern to the following: U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (new line), U+000D CARRIAGE RETURN, U+3000 IDEOGRAPHIC SPACE, or U+0020 SPACE.

Inside patterns and quoted literals, whitespace is part of the content and is recorded and stored verbatim. Whitespace is not significant outside translatable text, except where required by the syntax.

There are two whitespace productions in the syntax. Optional whitespace is whitespace that is not required by the syntax, but which users might want to include to increase the readability of a message. Required whitespace is whitespace that is required by the syntax.

Both types of whitespace optionally permit the use of the bidirectional isolate controls and certain strongly directional marks. These can assist users in presenting messages that contain right-to-left text, literals, or names (including those for functions, options, option values, and keys)

Messages that contain right-to-left (aka RTL) characters SHOULD use one of the following mechanisms to make messages display intelligibly in plain-text editors:

  1. Use paired isolating bidi controls U+2066 LEFT-TO-RIGHT ISOLATE ("LRI") and U+2069 POP DIRECTIONAL ISOLATE ("PDI") as permitted by the ABNF around parts of any message containing RTL characters:
    • inside of placeholder markers { and }
    • outside quoted-pattern markers {{ and }}
    • outside of variable, function, markup, or attribute, including the identifying sigil (e.g. <LRI>$var</PDI> or <LRI>:ns:name</PDI>)
  2. Use the 'local-effect' bidi marks U+061C ARABIC LETTER MARK, U+200E LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK as permitted by the ABNF before or after identifiers, names, unquoted literals, or option values, especially when the values contain a mix of neutral, weakly directional, and strongly directional characters.

Important

Always take care not to add bidirectional controls or marks where they would be semantically significant or where they would unintentionally become part of the message's output:

  • do not put them inside of a literal except when they are part of the value, (instead put them outside of literal quotes, such as <LRM>|...|<LRM>)
  • do not put them inside quoted patterns except when they are part of the text, (instead put them outside of quoted patterns, such as <LRI>{{...}}<PDI>)
  • do not put them outside placeholders, (instead put them inside the placeholder, such as {<LRI>$foo :number<PDI>})

Controls placed inside literal quotes or quoted patterns are part of the literal or pattern. Controls in a pattern will appear in the output of the message. Controls inside literal quotes are part of the literal and will be considered in operations such as matching a key to a selector.

Note

Users cannot be expected to create or manage bidirectional controls or marks in messages, since the characters are invisible and can be difficult to manage. Tools (such as resource editors or translation editors) and other implementations of MessageFormat 2 serialization are strongly encouraged to provide paired isolates around any right-to-left syntax as described above so that messages display appropriately as plain text.

These definitions of whitespace implement UAX#31 Requirement R3a-2. It is a profile of R3a-1 in that specification because:

  • The following pattern whitespace characters are not allowed: U+000B FORM FEED, U+000C VERTICAL TABULATION, U+0085 NEXT LINE, U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR.
  • The character U+3000 IDEOGRAPHIC SPACE is interpreted as whitespace.
  • The following directional marks and isolates are treated as ignorable format controls: U+061C ARABIC LETTER MARK, U+200E LEFT-TO-RIGHT MARK, U+200F RIGHT-TO-LEFT MARK, U+2066 LEFT-TO-RIGHT ISOLATE, U+2067 RIGHT-TO-LEFT ISOLATE, U+2068 FIRST STRONG ISOLATE, and U+2069 POP DIRECTIONAL ISOLATE. (The character U+061C is an addition according to R3a.)

Note

The character U+3000 IDEOGRAPHIC SPACE is included in whitespace for compatibility with certain East Asian keyboards and input methods, in which users might accidentally create these characters in a message.

; Required whitespace
s = *bidi ws o

; Optional whitespace
o = *(s / bidi)

; Bidirectional marks and isolates
; ALM / LRM / RLM / LRI, RLI, FSI & PDI
bidi = %x061C / %x200E / %x200F / %x2066-2069

; Whitespace characters
ws = SP / HTAB / CR / LF / %x3000

Complete ABNF

The grammar is formally defined in message.abnf using the ABNF notation [STD68], including the modifications found in RFC 7405.

RFC7405 defines a variation of ABNF that is case-sensitive. Some ABNF tools are only compatible with the specification found in RFC 5234. To make message.abnf compatible with that version of ABNF, replace the rules of the same name with this block:

input = %x2E.69.6E.70.75.74  ; ".input"
local = %x2E.6C.6F.63.61.6C  ; ".local"
match = %x2E.6D.61.74.63.68  ; ".match"