-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
i18n: handle string placeholder collisions #14432
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,11 +12,11 @@ The collection and translation pipeline: | |
``` | ||
Source files: Locale files: | ||
+---------------------------+ +---------------------------------------------- | ||
| ++ | shared/localization/locales/en-US.json | | ||
| const UIStrings = { ... };|-+ +---> | shared/localization/locales/en-XL.json | | ||
| ++ | shared/localization/locales/en-US.json | | ||
| const UIStrings = { ... };|-+ +---> | shared/localization/locales/en-XL.json | | ||
| |-| | +----------------------------------------------+ | ||
+-----------------------------| | | || | ||
+----------------------------| | | shared/localization/locales/*.json |-<+ | ||
+----------------------------| | | shared/localization/locales/*.json |-<+ | ||
+---------------------------+ | | || | | ||
| | +----------------------------------------------| | | ||
$ yarn | | +---------------------------------------------+ | | ||
|
@@ -198,11 +198,11 @@ CTC is a name that is distinct and identifies this as the Chrome translation for | |
```json | ||
{ | ||
"name": { | ||
"message": "Message text, with optional placeholders.", | ||
"message": "Message text, with optional placeholders, which can be $PLACEHOLDER_TEXT$", | ||
"description": "Translator-aimed description of the message.", | ||
"meaning": "Description given when a message is duplicated, in order to give context to the message. Lighthouse uses a copy of the description for this.", | ||
"placeholders": { | ||
"placeholder_name": { | ||
"PLACEHOLDER_TEXT": { | ||
"content": "A string to be placed within the message.", | ||
"example": "Translator-aimed example of the placeholder string." | ||
}, | ||
|
@@ -211,6 +211,23 @@ CTC is a name that is distinct and identifies this as the Chrome translation for | |
} | ||
``` | ||
|
||
### Collisions | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. history has shown we pretty reliably forget this information over time, so please give any feedback on ambiguity so we can clear it up now :) |
||
Collisions happen when two CTC messages have the same `message`. For Lighthouse, there are two relevant collision types in TC: | ||
- Allowed: the CTC `message`, `description`, and `placeholders` are exactly the same. These collisions are deduped on the TC side and the translation cost is the same as for a single string. | ||
- Disallowed: `message` is the same but one or more of the other properties differ. | ||
|
||
When the `message` needs to be the same as another string but another property must differ, that disallowed collision can be fixed by adding a unique `meaning` property to each colliding CTC message. TC will then consider those as separate strings and not a collision. | ||
|
||
In Lighthouse, this is done by having a different `description` for the strings, which is then copied to `meaning` in `resolveMessageCollisions()`. `meaning` cannot be manually set. | ||
|
||
For instance, the string "Potential Savings" currently refers to both saved KiB and saved milliseconds in different audits. The string is defined twice, each with a different `description` describing the units being saved, in case some locales' translations will use a different word choice depending on the unit. | ||
|
||
Internally, TC uses a message ID that's a hash of `message` and `meaning` to check for collisions. Somewhat confusingly, if two messages do have colliding IDs, then `message`, `meaning`, `description`, and `placeholders` are all required to match or an error is thrown. This is why all message properties could cause a collision but `meaning` is the only way to dedupe them. | ||
|
||
We treat it as an error if `placeholders` differ between messages in a collision: if there is a need for placeholders to differ, then the strings aren't really the same, and at least the `description` should be changed to explain that context. Placeholders must match in user-controlled data (e.g. if a placeholder has an `@example`, it must be the same example in all instances) and in Lighthouse-controlled data (e.g. the token used to replace it in the CTC `message`, like `$PLACEHOLDER_TEXT$` in the example above). | ||
|
||
Finally, identical messages made to not collide by Lighthouse with a `meaning` cost real money and shouldn't be confused with allowed collisions which cost nothing for each additional collision. Fixed collisions are checked against a known list to add a little friction and motivate keeping them few in number. An error is thrown if a collision is fixed that hasn't yet been added to that list. | ||
|
||
# Appendix | ||
|
||
## Appendix A: How runtime string replacement works | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,7 @@ import {expect} from 'expect'; | |
import tsc from 'typescript'; | ||
import MessageParser from 'intl-messageformat-parser'; | ||
import esMain from 'es-main'; | ||
import isDeepEqual from 'lodash/isEqual.js'; | ||
|
||
import {Util} from '../../util.cjs'; | ||
import {collectAndBakeCtcStrings} from './bake-ctc-to-lhl.js'; | ||
|
@@ -613,43 +614,90 @@ function writeStringsToCtcFiles(locale, strings) { | |
} | ||
|
||
/** | ||
* This function does two things: | ||
* | ||
* - Add `meaning` property to ctc messages that have the same message but different descriptions so TC can disambiguate. | ||
* - Throw if the known collisions has changed at all. | ||
* TODO: replace by Array.prototype.groupToMap when available. | ||
* @template {unknown} T | ||
* @template {string} U | ||
* @param {Array<T>} elements | ||
* @param {(element: T) => U} callback | ||
* @return {Map<U, Array<T>>} | ||
*/ | ||
function arrayGroupToMap(elements, callback) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Had to do this twice (nested) and was getting kind of gross. It's a nice API (for Node 20 or whatever) |
||
/** @type {Map<U, Array<T>>} */ | ||
const elementsByValue = new Map(); | ||
|
||
for (const element of elements) { | ||
const key = callback(element); | ||
const group = elementsByValue.get(key) || []; | ||
group.push(element); | ||
elementsByValue.set(key, group); | ||
} | ||
|
||
return elementsByValue; | ||
} | ||
|
||
/** | ||
* Returns whether all the placeholders in the given CtcMessages match. | ||
* @param {Array<{ctc: CtcMessage}>} strings | ||
* @return {boolean} | ||
*/ | ||
function doPlaceholdersMatch(strings) { | ||
// Technically placeholder `content` is not required to match by TC, but since | ||
// `example` must match and any auto-generated `example` is copied from `content`, | ||
// it would be confusing to let it differ when `example` is explicit. | ||
return strings.every(val => isDeepEqual(val.ctc.placeholders, strings[0].ctc.placeholders)); | ||
} | ||
|
||
/** | ||
* If two or more messages have the same `message`: | ||
* - if `description`, and `placeholders` are also the same, TC treats them as a | ||
* single message, the collision is allowed, and no change is made. | ||
* - if `description` is unique for the collision members, the `description` is | ||
* copied to `meaning` to differentiate them for TC. They must be added to the | ||
* `collidingMessages` list for this case as a nudge to not be wasteful. | ||
* - if `description` is the same but `placeholders` differ, an error is thrown. | ||
* Either `placeholders` should be the same, or `message` or `description` | ||
* should be changed to explain the need for different `placeholders`. | ||
* | ||
* Modifies `strings` in place to fix collisions. | ||
* @see https://github.com/GoogleChrome/lighthouse/blob/main/core/lib/i18n/README.md#collisions | ||
* @param {Record<string, CtcMessage>} strings | ||
*/ | ||
function resolveMessageCollisions(strings) { | ||
/** @type {Map<string, Array<CtcMessage>>} */ | ||
const stringsByMessage = new Map(); | ||
|
||
// Group all the strings by their message. | ||
for (const ctc of Object.values(strings)) { | ||
const collisions = stringsByMessage.get(ctc.message) || []; | ||
collisions.push(ctc); | ||
stringsByMessage.set(ctc.message, collisions); | ||
} | ||
const entries = Object.entries(strings).map(([key, ctc]) => ({key, ctc})); | ||
|
||
/** @type {Array<CtcMessage>} */ | ||
const allCollisions = []; | ||
const stringsByMessage = arrayGroupToMap(entries, (entry) => entry.ctc.message); | ||
for (const messageGroup of stringsByMessage.values()) { | ||
// If this message didn't collide with anything else, skip it. | ||
if (messageGroup.length <= 1) continue; | ||
|
||
// If group shares both message and description, they can be translated as if a single string. | ||
const descriptions = new Set(messageGroup.map(ctc => ctc.description)); | ||
if (descriptions.size <= 1) continue; | ||
// For each group of matching message+description, placeholders must also match. | ||
const stringsByDescription = arrayGroupToMap(messageGroup, (entry) => entry.ctc.description); | ||
for (const descriptionGroup of stringsByDescription.values()) { | ||
if (!doPlaceholdersMatch(descriptionGroup)) { | ||
throw new Error('string collision: `message` and `description` are the same but differ in `placeholders`:\n' + | ||
descriptionGroup.map(entry => `key: ${entry.key}\n`).join('') + | ||
'make `placeholders` match or update `message` or `description`s to explain why they don\'t match'); | ||
} | ||
} | ||
|
||
// If the entire message group has the same description and placeholders, | ||
// they can be translated as if a single message, no meaning needed. | ||
if (stringsByDescription.size <= 1) continue; | ||
|
||
// We have duplicate messages with different descriptions. Disambiguate using `meaning` for TC. | ||
for (const ctc of messageGroup) { | ||
for (const {ctc} of messageGroup) { | ||
ctc.meaning = ctc.description; | ||
} | ||
allCollisions.push(...messageGroup); | ||
} | ||
} | ||
|
||
// Check that the known collisions match our known list. | ||
const collidingMessages = allCollisions.map(collision => collision.message).sort(); | ||
/** | ||
* Check that fixed collisions match our known list. | ||
* @param {Record<string, CtcMessage>} strings | ||
*/ | ||
function checkKnownFixedCollisions(strings) { | ||
const fixedCollisions = Object.values(strings).filter(string => string.meaning); | ||
const collidingMessages = fixedCollisions.map(collision => collision.message).sort(); | ||
|
||
try { | ||
expect(collidingMessages).toEqual([ | ||
|
@@ -696,6 +744,7 @@ async function main() { | |
} | ||
|
||
resolveMessageCollisions(strings); | ||
checkKnownFixedCollisions(strings); | ||
|
||
writeStringsToCtcFiles('en-US', strings); | ||
console.log('Written to disk!', 'en-US.ctc.json'); | ||
|
@@ -734,4 +783,5 @@ export { | |
parseUIStrings, | ||
createPsuedoLocaleStrings, | ||
convertMessageToCtc, | ||
resolveMessageCollisions, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we mark tooltip/meter with backticks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took that from the axe docs (e.g. "Aria tooltip elements must have discernible text..."), though to be fair, their docs language selector has only one other language and that particular string stays in english :)
I kind of justified it in my head as it's an aria role, but also a description of the type of element, so it makes some sense to include a localized name.
tooltip
(andmeter
foraria-meter-name
) are backticked later in the string 🤷edit: I could make the second reference more explicitly the markup version,
Learn how to name `role="tooltip" ` elements