Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template node has incorrect range if the template contains emoji or other multi-byte characters 💩 #45

Closed
gitKrystan opened this issue Dec 5, 2023 · 5 comments · Fixed by #58 · May be fixed by #75
Closed

Template node has incorrect range if the template contains emoji or other multi-byte characters 💩 #45

gitKrystan opened this issue Dec 5, 2023 · 5 comments · Fixed by #58 · May be fixed by #75

Comments

@gitKrystan
Copy link
Contributor

gitKrystan commented Dec 5, 2023

version: content-tag 1.1.2

In investigating the root cause of ember-tooling/prettier-plugin-ember-template-tag#191 I discovered that content-tag is returning incorrect ranges when templates include multi-byte characters, such as emoji.

Reproduction:

For a 4-byte character:

import { Preprocessor } from 'content-tag';

const code = `import Component from '@glimmer/component';

class PooComponent extends Component {
  <template>💩</template>
}
`;

const p = new Preprocessor();
const templateNodes = p.parse(code); // Array of length 1

templateNodes[0].type
// 'class-member'

templateNodes[0].contents;
// '💩'

templateNodes[0].range
// {start: 86, end: 111}

code.slice(templateNodes[0].range.start, templateNodes[0].range.end)
// '<template>💩</template>
// }'

code.slice(templateNodes[0].endRange.start, templateNodes[0].endRange.end)
// 'template>
// }'

code.slice(templateNode.contentRange.start, templateNode.contentRange.end)
// '💩</'

Note that the range has gobbled up the following character(s).

Similarly, for a two-byte character:

import { Preprocessor } from 'content-tag';

const code = `import Component from '@glimmer/component';

class PoundComponent extends Component {
  <template>£</template>
}
`;

const p = new Preprocessor();
const templateNodes = p.parse(code); // Array of length 1

// code.slice(templateNode.contentRange.start, templateNode.contentRange.end)
'£<'

Interestingly, it gobbles fewer characters this time.

In the expression position, the issue is less noticeable, but still there:

import { Preprocessor } from 'content-tag';

const code = `<template>💩</template>
`;

const p = new Preprocessor();
const templateNodes = p.parse(code); // Array of length 1

templateNodes[0].type
// 'expression'

templateNodes[0].contents;
// '💩'

templateNodes[0].range
// {start: 0, end: 25}

code.slice(templateNodes[0].range.start, templateNodes[0].range.end)
// '<template>💩</template>
// '

code.slice(templateNode.contentRange.start, templateNode.contentRange.end)
// '💩</'
@gitKrystan gitKrystan changed the title Template node has incorrect range if the template contains emoji 💩 Template node has incorrect range if the template contains emoji or other multi-byte characters 💩 Dec 5, 2023
NullVoxPopuli added a commit that referenced this issue Dec 14, 2023
@NullVoxPopuli
Copy link
Contributor

Failing test PR here: #53

@chancancode
Copy link
Contributor

The spans/offsets provided are byte-offsets, not UTF-8-character-offsets, hence the discrepancy. It’s not obviously incorrect: it’s easier to grab a slice of a file with byte ranges than character ranges, and there may be good reasons for swc to make that choice (what does source map use?)?

it should also be possible to convert the byte ranges into character ranges in the consumer if that is desirable

@NullVoxPopuli
Copy link
Contributor

it should also be possible to convert the byte ranges into character ranges in the consumer if that is desirable

yeah -- at a min we'll need to provide a byteToCharRange function of docs or somethin, i think

@NullVoxPopuli
Copy link
Contributor

NullVoxPopuli commented Dec 15, 2023

I think the actual problem may be related to class vs template-only, rather than multi-byte strings.

A demo of the problem: https://runkit.com/nullvoxpopuli/content-tag-byte-vs-char-offsets

it matters what's before and after the <template>-tag

https://runkit.com/nullvoxpopuli/content-tag-byte-vs-char-offsets

image

I learned about Array.from as a way to get around multi-byte issues here: https://www.acuriousanimal.com/blog/20211205/javascript-handle-unicode

oh, but this is probably behavior I'm seeing because slice is forgiving and just gives me the whole rest of the string if I set an index that goes beyond the end.

@NullVoxPopuli
Copy link
Contributor

Aahhhh ha! wasn't so bad:

image

Using Buffer.from, I was able to use Rust's byte-indicies to map correctly

code
const { Preprocessor } = require("content-tag");

const p = new Preprocessor();

let before = `
import Component from '@glimmer/component';
import { on } from '@ember/modifier';

import { getSnippetElement, toClipboard, withExtraStyles } from './copy-utils';
import Menu from './menu';

/**
 * This component is injected via the markdown rendering
 */
export default class CopyMenu extends Component {
  copyAsText = (event: Event) => {
    let code = getSnippetElement(event);

    navigator.clipboard.writeText(code.innerText);
  };

  copyAsImage = async (event: Event) => {
    let code = getSnippetElement(event);

    await withExtraStyles(code, () => toClipboard(code));
  };

`;
let open = `<template>`;
let content = `안녕하세요 세계`
let close = `</template>`;
let after = `
}
`;

let contentLength = content.length;
let openLength = open.length;
let closeLength = close.length;

function runAndPrint(input) {
    let output = p.parse(input);
    let r = output[0];
    let range = JSON.stringify(r.range);
    let sliced = input.slice(r.range.start, r.range.end);
    let rLength = r.range.end - r.range.start;
    let asArray = Array.from(input);
    let arraySliced = asArray.slice(r.range.start, r.range.end).join('');

    let buffer = Buffer.from(input, 'utf8');
    let bufferSliced =(Buffer.from([...buffer].slice(r.range.start, r.range.end)).toString());

    console.log(`
results:       ${output.length}
range:         ${range}
range length:  ${rLength}
slice:         [${sliced}]
sliced length: ${sliced.length}
array:         [${arraySliced}]
array length:  ${arraySliced.length}
buffer:        [${bufferSliced}]
buffer length: ${bufferSliced.length}
    `);
}


runAndPrint(`${before}${open}${content}${close}${after}`);
runAndPrint(`${open}${content}${close}`);

gitKrystan added a commit to gitKrystan/content-tag that referenced this issue Dec 19, 2023
gitKrystan added a commit to gitKrystan/content-tag that referenced this issue Dec 19, 2023
gitKrystan added a commit to gitKrystan/content-tag that referenced this issue Dec 19, 2023
@ef4 ef4 closed this as completed in 164bb66 Dec 19, 2023
ef4 added a commit that referenced this issue Dec 19, 2023
Document API methods (Closes #45)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment