Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FCP HIPE: localized messages #64

Closed
wants to merge 22 commits into from

Conversation

dhh1128
Copy link
Contributor

@dhh1128 dhh1128 commented Dec 1, 2018

Signed-off-by: Daniel Hardman [email protected]

Signed-off-by: Daniel Hardman <[email protected]>
@dhh1128 dhh1128 changed the title Initial proposal propose HIPE: localized messages Dec 1, 2018
@swcurran
Copy link
Member

swcurran commented Dec 1, 2018

First cut is I really like this. This overlaps (overlays?) with the schemas and overlays work, as a key use case of that is localization specifically as it relates to schema. This is more general, but still complementary. With VON, we've already experienced the same challenge. Our first cut had been the traditional approach you mention - the UI software presenting the data does the localization, which does not scale to our many Issuers. The Issuers to be able to convey to the Holder the localization data about what is being issued.

The only concern I have with this is whether there is a need to focus on the trustworthiness of the catalog. I think the catalog is necessary. Is there a concern that the translations mechanism can be used to confuse users - e.g. translate "Yes" to "No" and vice-versa for an important transaction. Or is my tinfoil hat on too tight?

in a message; wherever it appears, it overrides any message catalog specified at a more general
level. The value of `@msg_catalog` is a URI (ideally, a DID reference):

[![sample5.png](sample5.png)](sample5.json)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can/should the @msg_catalog be scoped to the friendly_ltxt context to support different catalogs for multiple *_ltxt contexts within a single message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW - not sure I'm a fan of the images for JSON approach. Looks good, but it's hard to comment on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a fan of using the inline markdown formatting for code examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TelegramSam when you say "inline markdown formatting", are you talking about just doing codeblocks with a triple backtick and a declaration of the code type--or something fancier? If fancier, I want to learn how. I tried the backticks and was very unhappy with it. I wanted syntax highlighting, and I also wanted to be able to bold or select a subset of the JSON to call it out.

@swcurran I checked in a .json file for each of the JSON graphics--so you can leave comments on the lines of the json file instead of the image.

I don't love this solution. If there's something better, please tell.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use tripple ticks with the language right after. I use json as the language. Different markdown parsers highlight differently; some do useful things, some do nothing.

@TelegramSam
Copy link
Contributor

Ideas:

  1. Declare the locale at the message level. Then all strings can be assumed to be that locale unless otherwise specified. At a minimum, this allows machine translation in the absence of more localization detail.
  2. Use the attribute string itself as a reference to the string in the msg_catalog. This works well with short strings.
  3. Include localization detail with sibling attributes, not manipulation of the main attribute.
    Example:
{
  //... normal message stuff
  "@locale":"en",
  "@msg_catalog": "<catalog uri>",
  "some_string_attribute":"This is a test",
  "some_string_attribute_loc": {
    "code": "this_is_a_test",
    "es": "<translation of 'this is a test'>"
  }
}

Example Notes:
Even without the .._loc attribute, you can look up "This is a test" in the catalog.
The sibling ..._loc attribute makes it easier for schema parsers. They will ignore it if they are not expecting it.
These changes allows the localization to be 'additive' in that you can add localization without modifying the existing attribute structures. This removes barriers to making something localized when it wasn't to start out.

@dhh1128
Copy link
Contributor Author

dhh1128 commented Dec 7, 2018

  1. Declare the locale at the message level.

Good improvement. I'll update.

  1. Use the attribute string itself as a reference to the string in the msg_catalog. This works well with short strings.

I thought about this. I know it's sometimes done, but there are two drawbacks: you can't change the value of a string without invalidating its lookup in a catalog, and you have ambiguities where the same text means two different things (e.g., "control" as a verb in one place, and as a noun is another). I don't know if either of these are a big deal, but that's why I said the code was required. Maybe we make the code optional, so either lookup key could be used?

  1. Include localization detail with sibling attributes, not manipulation of the main attribute.
    The sibling ..._loc attribute makes it easier for schema parsers. They will ignore it if they are not expecting it.
    These changes allows the localization to be 'additive' in that you can add localization without modifying the existing attribute structures. This removes barriers to making something localized when it wasn't to start out.

These are good points. However, if we do this, then it is no longer possible to look at a field and realize by its naming convention that it is localizable. You only know it is localizable if it has a sibling attribute. And since I expect most people who write localizable messages will make no effort to localize them (e.g., they values of the attributes will be generated dynamically), we'd end either with no clues of which attributes to localize, or with a lot of stuff that looks like this:

"some_string_attribute":"This is a test",
"some_string_attribute_loc": {}

I am also dubious about the utility of making something localized when it wasn't to start out. My experience has been that if you weren't thinking about localization from the beginning, you usually have bigger problems than schema adjustments.

I don't know what to do about this, though--because even though I feel pretty strongly about my own reasoning, the other points raised by @TelegramSam are equally good. Is there some way we can have our cake and eat it too?

@dhh1128
Copy link
Contributor Author

dhh1128 commented Dec 7, 2018

focus on the trustworthiness of the catalog

This is a really important point, @swcurran . I'll add some text about it. Can you think of any ways to strengthen the security around it? An obvious way would be to publish the hash of the catalog so it couldn't be tampered with--but that feels like tedious, kludgey overkill... Yet the risk of hacking via the catalog is real...

@dhh1128
Copy link
Contributor Author

dhh1128 commented Dec 7, 2018

@TelegramSam:

Is there some way we can have our cake and eat it too?

What if we said that messages purporting to belong to a schema with required attribute "some_string_attribute", but lacking that field, could still be valid if a field named "some_string_attribute_ltxt" is present. If it is, then the latter attribute is the localized variant of "some_string_attribute" and should be interpreted as satisfying that field's place in the schema. In this way, messages could gain localization support without doing violence to a schema.

I dunno. I don't love it. Requires a parser/validator to do something quirky.

@dhh1128
Copy link
Contributor Author

dhh1128 commented Dec 7, 2018

@swcurran and @TelegramSam : I updated the HIPE to address all your comments. There's a security warning and best practices around the catalog hacking issue. The _ltxt field now has a sibling field instead of being a dict. (I used _l10n instead of _loc for the suffix, because _loc is likely to be used as a short form of _location in many schemas.) The HIPE now describes how to do a lookup on a string value if no code is given. And I have also included a note about how schemas that declare a field without the _ltxt suffix can upgrade to localization support--not automatically, but by noting this feature in their message family definition docs.

@swcurran Can you make the corresponding changes in the problem_report HIPE, such that friendly_ltxt (which I've noted in a separate comment should be renamed to explain_ltxt has a simple string value and sibling field explain_l10n, instead of having a value that's a dict?

@swcurran
Copy link
Member

swcurran commented Dec 7, 2018

focus on the trustworthiness of the catalog

This is a really important point, @swcurran . I'll add some text about it. Can you think of any ways to strengthen the security around it? An obvious way would be to publish the hash of the catalog so it couldn't be tampered with--but that feels like tedious, kludgey overkill... Yet the risk of hacking via the catalog is real...

I think there are a couple of pieces of thoughts related to prior art to look at here.

In the open source world, "catalogs" (as we are calling them here) evolve over time through community contributions. For example, many open source applications have releases that consist only of new translations done by community contributors. I think in this case, we have to expect the same model and we should design a system to support that. For example, a decentralized way to extend and rate (approve?) of a translation.

The Schemas and Overlays group are planning to have Schema Overlays (metadata associated with a Schema) that (in some cases) are for localization on the ledger. I don't know how far along that plan is to reality - and whether the indy-node team is comfortable with that. While the catalogs we are discussing are tied to messages vs. schema - would it be worthwhile to have them on the ledger - or at least immutable? Then the message receiver could be notified of what immutable message overlays are available and which ones should be used. I think it's doable, but it's complicated and adds a bunch of state for agents to track...

AFIAIK with say, python applications, given a catalog (in the case of an app,, that's in the codebase), all strings to be presented are first checked for a mapping to the localized string in the requested locale. With Sam's proposal of the catalog and locale in the message, and a user providing the desired locale to present, that should be pretty easy. Further, with the catalog and code field that should be easy as well - the code field would be used as the neutral form of the text.

Signed-off-by: Daniel Hardman <[email protected]>
Signed-off-by: Daniel Hardman <[email protected]>
@dhh1128
Copy link
Contributor Author

dhh1128 commented Dec 11, 2018

@swcurran: regarding the relation to schemas and overlays, and publishing on the ledger, I would say that that's a very interesting idea, but I don't want to hold up this HIPE for it. Let's keep a bookmark in it and circle back to it when the progress there makes the link to their work easier. (Note as well that W3C just decided to change the name of the "identifier registry" in the VC spec to "verifiable data registry", exactly to accommodate stuff like this, where you must have a provably correct version of something. We don't necessarily need to publish on the ledger; we could publish anywhere if the hash of the message catalog were included in the message. But we can work that out later.)

@TelegramSam
Copy link
Contributor

TelegramSam commented Dec 11, 2018

I love the _l10n suffix for a sibling decorator.

I wonder if the locale and catalog would be better organized under a structure like this:

"@l10n": {
"locale": "en",
"catalog": "<did ref>"
}

This feels simpler, and the @l10n matches the _l10n. This would be the first block form @annotation, which makes me a little nervous, but feels better than both a @Locale and a @msg_catalog.

My remaining issue is knowing which fields can be localized. It is fairly important that we articulate why not just any string should be localized because of the security risk of accidentally sending a secure secret to a translation service.
Already articulated are a few options:

  • Message Family Documentation: This one seems bad at the onset, but I don't think it's harmful at all. There are NO expectations of an unknown message family being processed. During development, the list of fields would need to be provided to the Message class to allow automatic localization. Now this is done by hand, but in the future could be automated via formalized message family docs.
  • Field Suffix: This is Daniel's favorite, but not mine. I feel like the cases where the field suffix would be used without a sibling field would be fairly rare in the advanced case, and the simple case likely isn't going to declare the suffix at all. This also requires a breaking change to add localization if the message family wasn't designed that way to begin with.
  • Sibling Field: I think this is a better explicit option, and will likely be the most common in advanced uses due to the need for a code.
  • In Message Field List: If we use the @l10n block I suggested, we could include a list of fields in there.

We certainly shouldn't allow all of these options due to the resulting complexity.

It occurs to me that there is a progression of localization maturity:

  1. None at all.
  2. Stated locale. Allows machine translation. Assumes discovery (or develop time encoding) of localizable fields.
  3. Stated locale and catalog. Lookup via full field values.
  4. Stated locale and catalog, fields have explicit catalog codes in sibling fields.

The inline localizations fit in with 2 and 3 as an alternative or addition to a catalog.

I'm going to guess that many families will be prototyped and tested at level 0 or 1. If breaking changes are required to progress up the levels of localization it would cause a Major version update to the message family, which may or may not be a desirable quality.

On catalog attack prevention: Can we sign the catalog with the key in the DID doc used to reference the catalog? Just an idea, we should handle this in a future HIPE.

@swcurran
Copy link
Member

@swcurran: regarding the relation to schemas and overlays, and publishing on the ledger, I would say that that's a very interesting idea, but I don't want to hold up this HIPE for it. Let's keep a bookmark in it and circle back to it when the progress there makes the link to their work easier. (Note as well that W3C just decided to change the name of the "identifier registry" in the VC spec to "verifiable data registry", exactly to accommodate stuff like this, where you must have a provably correct version of something. We don't necessarily need to publish on the ledger; we could publish anywhere if the hash of the message catalog were included in the message. But we can work that out later.)

Agreed - that's the right approach.

Signed-off-by: Daniel Hardman <[email protected]>
Signed-off-by: Daniel Hardman <[email protected]>
Signed-off-by: Daniel Hardman <[email protected]>
Signed-off-by: Daniel Hardman <[email protected]>
Signed-off-by: Daniel Hardman <[email protected]>
@TelegramSam
Copy link
Contributor

@dhh1128 Your work on this HIPE is incredible. Thank you for your effort.

The only changes I can see that are needed is the cleanup of the last section and filling out the complex example.

Signed-off-by: Daniel Hardman <[email protected]>
Signed-off-by: Daniel Hardman <[email protected]>
Signed-off-by: Daniel Hardman <[email protected]>
Signed-off-by: Daniel Hardman <[email protected]>
Signed-off-by: Daniel Hardman <[email protected]>
Signed-off-by: Daniel Hardman <[email protected]>
Signed-off-by: Daniel Hardman <[email protected]>
Signed-off-by: Daniel Hardman <[email protected]>
@TelegramSam TelegramSam changed the title propose HIPE: localized messages FCP HIPE: localized messages Apr 3, 2019
Signed-off-by: Daniel Hardman <[email protected]>
@dhh1128
Copy link
Contributor Author

dhh1128 commented May 28, 2019

This is superseded by hyperledger/aries-rfcs#43.

@dhh1128 dhh1128 closed this May 28, 2019
@dhh1128 dhh1128 deleted the localized-messages branch May 31, 2019 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants