Template node has incorrect range if the template contains emoji or other multi-byte characters 💩 #45

gitKrystan · 2023-12-05T05:17:15Z

version: content-tag 1.1.2

In investigating the root cause of ember-tooling/prettier-plugin-ember-template-tag#191 I discovered that content-tag is returning incorrect ranges when templates include multi-byte characters, such as emoji.

Reproduction:

For a 4-byte character:

import { Preprocessor } from 'content-tag';

const code = `import Component from '@glimmer/component';

class PooComponent extends Component {
  <template>💩</template>
}
`;

const p = new Preprocessor();
const templateNodes = p.parse(code); // Array of length 1

templateNodes[0].type
// 'class-member'

templateNodes[0].contents;
// '💩'

templateNodes[0].range
// {start: 86, end: 111}

code.slice(templateNodes[0].range.start, templateNodes[0].range.end)
// '<template>💩</template>
// }'

code.slice(templateNodes[0].endRange.start, templateNodes[0].endRange.end)
// 'template>
// }'

code.slice(templateNode.contentRange.start, templateNode.contentRange.end)
// '💩</'

Note that the range has gobbled up the following character(s).

Similarly, for a two-byte character:

import { Preprocessor } from 'content-tag';

const code = `import Component from '@glimmer/component';

class PoundComponent extends Component {
  <template>£</template>
}
`;

const p = new Preprocessor();
const templateNodes = p.parse(code); // Array of length 1

// code.slice(templateNode.contentRange.start, templateNode.contentRange.end)
'£<'

Interestingly, it gobbles fewer characters this time.

In the expression position, the issue is less noticeable, but still there:

import { Preprocessor } from 'content-tag';

const code = `<template>💩</template>
`;

const p = new Preprocessor();
const templateNodes = p.parse(code); // Array of length 1

templateNodes[0].type
// 'expression'

templateNodes[0].contents;
// '💩'

templateNodes[0].range
// {start: 0, end: 25}

code.slice(templateNodes[0].range.start, templateNodes[0].range.end)
// '<template>💩</template>
// '

code.slice(templateNode.contentRange.start, templateNode.contentRange.end)
// '💩</'

The text was updated successfully, but these errors were encountered:

NullVoxPopuli · 2023-12-14T21:26:24Z

Failing test PR here: #53

chancancode · 2023-12-14T23:35:38Z

The spans/offsets provided are byte-offsets, not UTF-8-character-offsets, hence the discrepancy. It’s not obviously incorrect: it’s easier to grab a slice of a file with byte ranges than character ranges, and there may be good reasons for swc to make that choice (what does source map use?)?

it should also be possible to convert the byte ranges into character ranges in the consumer if that is desirable

NullVoxPopuli · 2023-12-14T23:41:53Z

it should also be possible to convert the byte ranges into character ranges in the consumer if that is desirable

yeah -- at a min we'll need to provide a byteToCharRange function of docs or somethin, i think

NullVoxPopuli · 2023-12-15T00:01:02Z

I think the actual problem may be related to class vs template-only, rather than multi-byte strings.

A demo of the problem: https://runkit.com/nullvoxpopuli/content-tag-byte-vs-char-offsets

it matters what's before and after the <template>-tag

https://runkit.com/nullvoxpopuli/content-tag-byte-vs-char-offsets

I learned about Array.from as a way to get around multi-byte issues here: https://www.acuriousanimal.com/blog/20211205/javascript-handle-unicode

oh, but this is probably behavior I'm seeing because slice is forgiving and just gives me the whole rest of the string if I set an index that goes beyond the end.

NullVoxPopuli · 2023-12-15T00:27:33Z

Aahhhh ha! wasn't so bad:

Using Buffer.from, I was able to use Rust's byte-indicies to map correctly

code

const { Preprocessor } = require("content-tag");

const p = new Preprocessor();

let before = `
import Component from '@glimmer/component';
import { on } from '@ember/modifier';

import { getSnippetElement, toClipboard, withExtraStyles } from './copy-utils';
import Menu from './menu';

/**
 * This component is injected via the markdown rendering
 */
export default class CopyMenu extends Component {
  copyAsText = (event: Event) => {
    let code = getSnippetElement(event);

    navigator.clipboard.writeText(code.innerText);
  };

  copyAsImage = async (event: Event) => {
    let code = getSnippetElement(event);

    await withExtraStyles(code, () => toClipboard(code));
  };

`;
let open = `<template>`;
let content = `안녕하세요 세계`
let close = `</template>`;
let after = `
}
`;

let contentLength = content.length;
let openLength = open.length;
let closeLength = close.length;

function runAndPrint(input) {
    let output = p.parse(input);
    let r = output[0];
    let range = JSON.stringify(r.range);
    let sliced = input.slice(r.range.start, r.range.end);
    let rLength = r.range.end - r.range.start;
    let asArray = Array.from(input);
    let arraySliced = asArray.slice(r.range.start, r.range.end).join('');

    let buffer = Buffer.from(input, 'utf8');
    let bufferSliced =(Buffer.from([...buffer].slice(r.range.start, r.range.end)).toString());

    console.log(`
results:       ${output.length}
range:         ${range}
range length:  ${rLength}
slice:         [${sliced}]
sliced length: ${sliced.length}
array:         [${arraySliced}]
array length:  ${arraySliced.length}
buffer:        [${bufferSliced}]
buffer length: ${bufferSliced.length}
    `);
}


runAndPrint(`${before}${open}${content}${close}${after}`);
runAndPrint(`${open}${content}${close}`);

Document API methods (Closes #45)

gitKrystan mentioned this issue Dec 5, 2023

[Bug] 2.0.0 Errors when components contain multi-byte characters (e.g. emoji) 💩 (not always, though) ember-tooling/prettier-plugin-ember-template-tag#191

Closed

gitKrystan changed the title ~~Template node has incorrect range if the template contains emoji 💩~~ Template node has incorrect range if the template contains emoji or other multi-byte characters 💩 Dec 5, 2023

NullVoxPopuli added a commit that referenced this issue Dec 14, 2023

Reproduce issue described in #45

88cea34

NullVoxPopuli mentioned this issue Dec 18, 2023

Use content-tag instead of ember-template-imports typed-ember/glint#655

Merged

gitKrystan added a commit to gitKrystan/content-tag that referenced this issue Dec 19, 2023

Document API methods (Closes embroider-build#45)

a12d57b

gitKrystan added a commit to gitKrystan/content-tag that referenced this issue Dec 19, 2023

Document API methods (Closes embroider-build#45)

1659b17

gitKrystan added a commit to gitKrystan/content-tag that referenced this issue Dec 19, 2023

Document API methods (Closes embroider-build#45)

dc380f3

ef4 closed this as completed in 164bb66 Dec 19, 2023

ef4 added a commit that referenced this issue Dec 19, 2023

Merge pull request #57 from gitKrystan/api-docs

384c86b

Document API methods (Closes #45)

NullVoxPopuli mentioned this issue Dec 21, 2023

[ENHANCEMENT] Use content-tag to parse GTS in blueprints ember-cli/ember-cli#10418

Merged

NullVoxPopuli mentioned this issue Dec 29, 2023

fix rust utf8 ranges vs js utf16 ranges ember-tooling/ember-eslint-parser#20

Merged

github-actions bot mentioned this issue Jan 22, 2024

Prepare Release #58

Merged

NullVoxPopuli mentioned this issue Jan 22, 2024

GJS parsed as JSX typed-ember/glint#694

Closed

NullVoxPopuli mentioned this issue Mar 11, 2024

add transform callback to options so that tooling authors can customize the transform and still have correct parse / process results without a bunch of index math / conversions #75

Draft

5 tasks

NullVoxPopuli mentioned this issue Aug 15, 2024

Files with unicode in a non-last component breaks Glint typed-ember/glint#756

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Template node has incorrect range if the template contains emoji or other multi-byte characters 💩 #45

Template node has incorrect range if the template contains emoji or other multi-byte characters 💩 #45

gitKrystan commented Dec 5, 2023 •

edited

Loading

NullVoxPopuli commented Dec 14, 2023

chancancode commented Dec 14, 2023

NullVoxPopuli commented Dec 14, 2023

NullVoxPopuli commented Dec 15, 2023 •

edited

Loading

NullVoxPopuli commented Dec 15, 2023

Template node has incorrect range if the template contains emoji or other multi-byte characters 💩 #45

Template node has incorrect range if the template contains emoji or other multi-byte characters 💩 #45

Comments

gitKrystan commented Dec 5, 2023 • edited Loading

Reproduction:

NullVoxPopuli commented Dec 14, 2023

chancancode commented Dec 14, 2023

NullVoxPopuli commented Dec 14, 2023

NullVoxPopuli commented Dec 15, 2023 • edited Loading

NullVoxPopuli commented Dec 15, 2023

gitKrystan commented Dec 5, 2023 •

edited

Loading

NullVoxPopuli commented Dec 15, 2023 •

edited

Loading