Skip to content

Translation Helper API

Vladimir Schneider edited this page Oct 19, 2019 · 12 revisions

Formatter doubles as a translation helper to assist in extracting translatable text spans from the document and replacing non-translating text with identifiers so that they are not changed by the translation.

ℹ️ in versions prior to 0.60.0 formatter functionality was implemented in flexmark-formatter module and required an additional dependency.

The assumption for the process and format of the extracted text is that the translation process will not change the markup elements consisting of *~()[]{}<># characters. Non-translating text is replaced with placeholder text _#_ where # is an integer used to identify the original text of the placeholder.

The translation process used was tested with Yandex.Translate which does an excellent job of preserving markdown markup during translation.

Translation process

The translation process is handled is several steps:

  1. Parse the document to get markdown AST, this is normal flexmark-java markdown parsing.

  2. Format the document to get markdown strings for translation, document node from step 1 is used with purpose set to RenderPurpose.TRANSLATED_SPANS

  3. Get the strings to be translated from translation handler.

  4. Translate the strings by your translation service of preference.

  5. Set the translated strings in the translation handler.

  6. Generate markdown with placeholders for non-translating string and out of context translations, document node from step 1 is used with purpose set to RenderPurpose.TRANSLATED_SPANS

  7. Parse the document with placeholders. This is normal flexmark-java markdown parsing done on the document text returned from step 6.

  8. Generate the final translated markdown with all non-translating placeholders replaced by original text and translating placeholders by their translated text, document node from step 7 is used with purpose set to RenderPurpose.TRANSLATED.

The extracted text runs are classified into three different types:

  1. Translating Spans - these are paragraphs, heading text, table cells and other stretches of text which can contain inline text elements: bold, italic and other custom elements such as strike-through, inserted, deleted, etc. The inline code element is excluded and considered to non-translating, preserving its text as is.
  2. Non-Translating snippets - these are all text which should not be translated such as: link URI, identifiers, html blocks, inline html tags, etc.
  3. Translating snippets - these are text parts of other elements such as link text, image alt, reference link id, reference definition id. This translatable text is translated as a separate element outside the context of its container text.

For example:

Paragraph text with embedded link [Example Link](http://example.com) in it.

Although the link text appears inside a translating text span, it should not be translated as part of it because the translator can erroneously use its context to change the translation. The same element appearing in a different textual context would result in a different translation.

To eliminate such effects, the text Example Link will be replaced in the paragraph for translation by its placeholder _1_ and its text provided as a separate translatable string. The non-translating URL will be replaced by _2_ placeholder and excluded from the translating text list.

In this example, the extracted translating text strings will be:

  1. Example Link
  2. Paragraph text with embedded link [_1_](_2_) in it.

If the following example translations provided to translation handler:

  1. eXaAmpLeE liINK
  2. paARaAGRaAph teEXt WiIth eEmBeEDDeED LiINK [_1_](_2_) iIN iIt.

Generation of the document at step 6 of the translation process will result in:

paARaAGRaAph teEXt WiIth eEmBeEDDeED LiINK [_1_](_2_) iIN iIt.

Parsing this markdown text in step 7 and generating the final document with placeholder replacement in step 8 will result in the translated document:

paARaAGRaAph teEXt WiIth eEmBeEDDeED LiINK [eXaAmpLeE liINK](http://example.com) iIN iIt.

A translator usage example is included in the flexmark-java-samples module TranslationSample.java

Implementation Details

Translation assistance is provided by Formatter.translationRender() methods which take the same arguments as Formatter.render() with two additional arguments: TranslationHandler, RenderPurpose.

RenderPurpose set the purpose of the translation rendering:

  • RenderPurpose.FORMAT - regular format, same as the Formatter.render methods
  • RenderPurpose.TRANSLATION_SPANS - extract translating text spans from the document and identify non-translating text spans
  • RenderPurpose.TRANSLATED_SPANS - replace translating text spans with translated corresponding text.
  • RenderPurpose.TRANSLATED - replace placeholder text with translated or original text depending on the placeholder.

TranslationHandler provides functionality for tracking translating and non-translating spans, storage of information between renderer invocations. The default implementation can be customized or completely replaced.

The difficulty in the translation process is to ensure that intermediate text with placeholders results in text which will be recognized as the original markdown element which produced the placeholder. For this the parser is modified to recognize placeholders as valid elements.

For example, HTML block element is replaces with a single <___#_> where # is the integer placeholder ordinal position. Normally this is not a valid HTML block tag, but for purposes of translation the parser will recognize it as such. Similarly, inline HTML elements are replaced with <__#_> and auto-link URLs with <____#_>.

Other caveats, include reference block element ids and their references which for proper markdown parsing require to have matching placeholders, otherwise they will not properly resolve and not result in the desired AST for placeholder replacement.

One such caveat relates to anchor refs which refer to heading elements and which are defined by the heading text. The translation process through the formatter will replace any anchor references in links to headings in the same document with new anchor refs, generated the translated heading text.

The most complex handling of reference consistency exists in the EnumeratedReferenceNodeFormatter.java and AttributesNodeFormatter.java where each enumerated reference consists of two parts category:id with both parts needing to be consistent because category part of the reference can also be used without the id part.

To help customize the placeholder format, recognition of these placeholder by the parser and exclusion of non-translating text snippets, the following options are available in the Formatter

  • TRANSLATION_ID_FORMAT, default "_%d_", format used for String.format(format, placeholderId)` to convert an integer id into a text placeholder.
  • TRANSLATION_HTML_BLOCK_PREFIX, default "__", characters prefixed to placeholder text to distinguish an HTML block tag from the HTML inline block tags and other non-translating text during translation formatting.
  • TRANSLATION_HTML_INLINE_PREFIX, default "_", characters prefixed to placeholder text to distinguish an HTML inline tag from the HTML block tags and other non-translating text during translation formatting.
  • TRANSLATION_AUTOLINK_PREFIX, default "_", characters prefixed to placeholder text to distinguish an auto-link placeholder tag from the HTML block tags and other non-translating text during translation formatting.
  • TRANSLATION_EXCLUDE_PATTERN, default "^[^\\p{IsAlphabetic}]*$", pattern to exclude any translating strings which match the pattern. The default will exclude any which do not contain any unicode alphabetic character group.
  • TRANSLATION_HTML_BLOCK_TAG_PATTERN, default "___(?:\\d+)_", parser pattern used to recognize HTML block tags which contain translation placeholders.
  • TRANSLATION_HTML_INLINE_TAG_PATTERN, default "__(?:\\d+)_", parser pattern used to recognize HTML block tags which contain translation placeholders.

Custom Element Translation

Custom elements which contain no identifiers nor non-translating text need no changes since by default all text nodes are treated as translating spans.

All text elements and reference identifiers in custom elements require implementing a NodeFormatter with handling of node rendering by using translation API methods for rendering the text implemented in the MarkdownWriter used for appending formatted markdown.

  • MarkdownWriter.appendNonTranslating(CharSequence) - will render a non-translating text snippet, depending on the rendering purpose it either takes text to be replaced with a placeholder, takes a placeholder and passes it through as is, or replaces the placeholder with original text.
  • MarkdownWriter.appendTranslating(CharSequence) - will render a translating text snippet, depending on the rendering purpose it either takes text to be replaced with a placeholder, takes a placeholder and passes it through as is, or replaces the placeholder with translated text.

Handling of translating and non-translating text spans is handled through the NodeFormatterContext:

  • NodeFormatterContext.translatingSpan(TranslatingSpanRender) - will treat all text rendered by TranslatingSpanRender as translating. Depending on the rendering purpose will collect the text for translating, replace it with the translated text or simply pass it through to the MarkdownWriter as is.

  • NodeFormatterContext.nonTranslatingSpan(TranslatingSpanRender) - will treat all text rendered by TranslatingSpanRender as non-translating. Depending on the rendering purpose will replace the text by placeholder text, replace it with the original text or simply pass it through to the MarkdownWriter as is.

For examples of how references are handled it is best to reference implementation of core elements in CoreNodeFormatter.java or extensions: