Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape entities not correctly parsed/escaped #36

Open
RaminRabani opened this issue Nov 15, 2021 · 1 comment · May be fixed by #47
Open

Escape entities not correctly parsed/escaped #36

RaminRabani opened this issue Nov 15, 2021 · 1 comment · May be fixed by #47

Comments

@RaminRabani
Copy link
Contributor

A prerequisite to this bug is to fix the bug described in #35.

When the entities object is correctly passed to the WebVTTCueTextParser on line 177 as described in the issue above, there are some problems with parsing escape entities.

Steps to Reproduce

const { parse } = new WebVTTParser({
  "&amp": "&",
  "&": "&",
  "&": "&",
  "&AMP": "&",
});

const text1 = `
WEBVTT

1
00:11:46.140 --> 00:11:48.380
Texas A&M`

const text2 = `
WEBVTT

1
00:11:46.140 --> 00:11:48.380
Texas A&amp`

const text3 = `
WEBVTT

1
00:11:46.140 --> 00:11:48.380
Texas A&ampM`

const parsed1 = parse(text1, "metadata");
console.log(parsed1.cues[0].tree.children[0].value); // Texas A&M (correctly parsed)

const parsed2 = parse(text2, "metadata");
console.log(parsed2.cues[0].tree.children[0].value); // Texas A& (correctly parsed)

const parsed3 = parse(text3, "metadata");
console.log(parsed3.cues[0].tree.children[0].value); // Texas A&ampM (incorrectly parsed)

As you can see if the escape characters &amp are not followed by ; or the end of the string (undefined) but instead followed by another alphanumeric character, the escape characters are not properly parsed.

Solution

I believe some conditional logic needs to be updated or added in lines 632-670 to account for escape entities that are followed by an alphanumeric character.

@lionel-rowe
Copy link

Looks like this is an issue with the parsing logic, along with the default entities list.

The default entities list includes no semicolons (presumably intended to allow more lenient parsing), so the current parsing logic for > yields >; rather than >. In addition, the intended lenient parsing also fails unless it happens to be at the end of the node, because the parser relies on hitting ";" or the end of the node to parse the entity.

function cook(raw, suffix = '') {
	const cooked = new WebVTTParser().parse(
		`WEBVTT\n\n00:00.000 --> 00:10.000\n${raw}${suffix}\n`,
		'metadata',
	).cues[0].tree.children[0].value

	return cooked.slice(0, cooked.length - suffix.length)
}

const raws = ['&', '>', '<', '&gt', '&lt', '&amp', '&gt']

raws.map((x) => cook(x))
// expected: ['&', '>', '<', '>', '<', '&', '>']
// actual:   ['&', '>;', '<;', '>', '<', '&', '>']

raws.map((x) => cook(x, '--'))
// expected: ['&', '>', '<', '>', '<', '&', '>']
// actual:   ['&', '>;', '<;', '&gt', '&lt', '&amp', '&gt']

I guess for many use cases the lenient parsing isn't necessary, so this can be worked around by instead passing a list of well-formed default entities to the constructor:

const raw = '&gt; &lt; &amp;'

const brokenParser = new WebVTTParser()
brokenParser.parse(`WEBVTT\n\n00:00.000 --> 00:10.000\n${raw}\n`, 'metadata').cues[0].tree.children[0].value
// result: '>; <; &;'

// same as default entities list but well-formed
const entities = {
	'&amp;': '&',
	'&lt;': '<',
	'&gt;': '>',
	'&lrm;': '\u200e',
	'&rlm;': '\u200f',
	'&nbsp;': '\u00A0',
}

const fixedParser = new WebVTTParser(entities)
fixedParser.parse(`WEBVTT\n\n00:00.000 --> 00:10.000\n${raw}\n`, 'metadata').cues[0].tree.children[0].value
// result: '> < &'

@lionel-rowe lionel-rowe linked a pull request Feb 26, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants