Update Tokenizer to treat Markdown code as text instead of HTML #1

danielbrzn · 2018-01-25T16:23:23Z

This fix allows Markdown code to contain '<' , '<=' without having it affect other HTML elements as it is now treated as a text element. Furthermore, no spaces are required when typing these symbols within the back ticks.

As such, inequalities like the above can be rendered normally as shown below.

Resolves MarkBind/markbind#101

acjh · 2018-01-25T16:41:27Z

lib/Tokenizer.js

@@ -144,10 +144,12 @@ function Tokenizer(options, cbs){
 	this._ended = false;
 	this._xmlMode = !!(options && options.xmlMode);
 	this._decodeEntities = !!(options && options.decodeEntities);
+    this._isMarkdownCode = false;


Tabs vs spaces 😨

Let's be consistent with the rest of the file (tabs).

acjh · 2018-01-25T16:42:26Z

lib/Tokenizer.js

@@ -635,6 +637,9 @@ Tokenizer.prototype.write = function(chunk){
 Tokenizer.prototype._parse = function(){
 	while(this._index < this._buffer.length && this._running){
 		var c = this._buffer.charAt(this._index);
+		// Detect Markdown code so that it is parsed as text instead of HTML
+		if (c === '`')


No space before opening parentheses 😢

Let's be consistent with L152 of the file (braces even for single line of code).

acjh · 2018-01-25T16:51:14Z

Off-topic: Add a white space around operators :)

x < y
x <= y

We don't have a JS coding standard but:

danielbrzn · 2018-01-25T17:14:35Z

Updated with the requested changes, somehow my WebStorm was set to indent with spaces and I didn't manage to catch the difference in the editor.

Thanks for the tip about the white space!

acjh · 2018-01-25T17:18:31Z

lib/Tokenizer.js

 }

 Tokenizer.prototype._stateText = function(c){
-	if(c === "<"){
+	// parse open tags if it is not Markdown


Parse (capital P)

tag (singular)

acjh · 2018-01-25T17:22:26Z

lib/Tokenizer.js

@@ -635,6 +637,10 @@ Tokenizer.prototype.write = function(chunk){
 Tokenizer.prototype._parse = function(){
 	while(this._index < this._buffer.length && this._running){
 		var c = this._buffer.charAt(this._index);
+		// Detect Markdown code so that it is parsed as text instead of HTML
+		if (c === '`') {


No spaces before/after parentheses.

- Allows Markdown code to contain '<' , '<=' without having it affect other HTML elements

acjh · 2018-01-26T06:20:03Z

We should treat <␣ and <= as text as well:

a < b
a <= b

This is reasonable: create a .html file with the above and open it in your browser (tested in Chrome).

danielbrzn · 2018-01-26T15:35:54Z

Seems like the first case is handled fine. I've modified Tokenizer.js to treat <= as text, but there's a peculiar bug with the beautifying process that uses js-beautify

x <= y will get beautified to x <=y. I've tried using this fix mentioned here but it doesn't work. I'd reckon that the beautifier thinks <= is a valid open tag. Any ideas on how I could fix this?

acjh · 2018-01-26T17:53:24Z

I've modified Tokenizer.js to treat <= as text, but there's a peculiar bug with the beautifying process that uses js-beautify

Can you commit and push, so we can attempt to repro?

Any ideas on how I could fix this?

Try updating js-beautify from 1.6.12 to 1.7.5 and see if the problem still exists.

danielbrzn · 2018-01-26T18:50:43Z

js-beautify is at version 1.7.5 and the problem still persists unfortunately.

acjh · 2018-01-26T19:06:27Z

No repro:

danielbrzn · 2018-01-27T04:56:20Z

Are you generating the site from a index.md or a index.html? I get the bug when it's a html file, but not when it's an md file.

acjh · 2018-01-27T05:38:11Z

Ah, I see that I suggested to "create a .html file" to see how the browser treats those strings.
Repro-ed when using markbind build with a .html file.

We don't have to solve that in this PR since:

it works with .md files which we're primarily concerned with,
it doesn't break anything, and
it's not caused by bad code in this PR.

So it's partial support for .html files: Given "a <= b", this PR gives "a <=b" instead of just "a".

acjh · 2018-01-27T05:39:58Z

lib/Tokenizer.js

@@ -160,6 +163,9 @@ Tokenizer.prototype._stateText = function(c){
 		this._baseState = TEXT;
 		this._state = BEFORE_ENTITY;
 		this._sectionStart = this._index;
+	} else if(this._isInequality){
+		// Next character should be parsed normally
+		this._isInequality = !this._isInequality;


This should be this._isInequality = false; since it's not a toggle.

acjh · 2018-01-27T05:41:01Z

lib/Tokenizer.js

 }

 Tokenizer.prototype._stateText = function(c){
-	if(c === "<"){
+	// Parse open tag if it is not Markdown and not part of an inequality
+	if(c === "<" && !this._isMarkdownCode && !this._isInequality){


Why is && !this._isInequality necessary?

This is such that the Tokenizer doesn't think that the < of a <= is the start of an open HTML tag.

acjh · 2018-01-27T05:42:55Z

lib/Tokenizer.js

+		} else if(c === '<'){
+			var nextChar = this._buffer.charAt(this._index + 1);
+			if(nextChar === '='){
+				this._isInequality = !this._isInequality;


Should this be this._isInequality = true;?

acjh · 2018-01-27T05:44:06Z

lib/Tokenizer.js

@@ -144,10 +144,13 @@ function Tokenizer(options, cbs){
 	this._ended = false;
 	this._xmlMode = !!(options && options.xmlMode);
 	this._decodeEntities = !!(options && options.decodeEntities);
+	this._isMarkdownCode = false;
+	this._isInequality = false;


Reorder just these 2 in alphabetical order.

acjh · 2018-01-27T08:47:23Z

lib/Tokenizer.js

@@ -160,6 +163,9 @@ Tokenizer.prototype._stateText = function(c){
 		this._baseState = TEXT;
 		this._state = BEFORE_ENTITY;
 		this._sectionStart = this._index;
+	} else if(this._isInequality){
+		// Next character should be parsed normally
+		this._isInequality = false;


Can this be the first if condition?

If it's the first if condition, this._isInequality would be set to false and then < would then be treated as a valid open tag

It won't enter the else if block though?

Woops, yes that's right. Will resolve it ASAP.

acjh · 2018-01-27T08:51:21Z

lib/Tokenizer.js

+			this._isMarkdownCode = !this._isMarkdownCode;
+		} else if(c === '<'){
+			var nextChar = this._buffer.charAt(this._index + 1);
+			if(nextChar === '='){


This needs a comment for consistency.

Index should also be checked: if(c === '<' && this._index + 1 < this._buffer.length){

Good point about the index, will do so.

Should the comment be inside the else if block or outside of it?

It can be inside if you add a section name.

acjh · 2018-01-27T09:16:16Z

lib/Tokenizer.js

+			if(nextChar === '='){
+				this._isInequality = true;
+			}
+		}


Add a newline before and after this entire block.

Maybe add a section name like the ones below.

By section name, do you mean changing '=' into something like EQUALS?

if(nextChar === EQUALS){ this._isInequality = true; }

I mean these: https://github.com/MarkBind/htmlparser2/pull/1/files#diff-00550ec11d6b5101df5a54c5fee7cc2eR670

Would special conditions be an appropriate section name?

Fine for now.

acjh · 2018-02-01T04:37:01Z

lib/Tokenizer.js

+	if(this._isInequality){
+		// Next character will be parsed normally
+		this._isInequality = false;
+	} else if(c === "<" && !this._isMarkdownCode && !this._isInequality){


&& !this._isInequality should be removed.

acjh · 2018-02-01T04:42:07Z

lib/Tokenizer.js

+		*	special conditions
+		*/
+		if(c === '`'){
+			// Detect Markdown code to be parsed as text


~~Detect~~ Toggle

acjh · 2018-02-01T04:45:05Z

lib/Tokenizer.js

+			this._isMarkdownCode = !this._isMarkdownCode;
+		} else if(c === '<' && this._index + 1 < this._buffer.length){
+			var nextChar = this._buffer.charAt(this._index + 1);
+			// Detect '<=' inequality to be parsed as text


~~Detect~~ Set

Also, move this comment into the if block.

danielbrzn · 2018-02-01T05:15:59Z

Made the necessary changes.

acjh · 2018-02-01T06:15:14Z

lib/Tokenizer.js

+				this._isInequality = true;
+			}
+		}
+


Hmm, this still looks out-of-place.

Let's introduce a new state MARKDOWN instead of tracking this._isMarkdownCode and this._isInequality.

Near top of file:

MARKDOWN = i++, TEXT = i++, // No change

In this function:

if(this._state === MARKDOWN) { this._stateMarkdown(c); } else if (this.state === TEXT) { this._stateText(c); // No change

Other functions:

Tokenizer.prototype._stateMarkdown = function(c){ if(c === '`'){ this._state = TEXT; } } Tokenizer.prototype._stateText = function(c){ if(c === '`'){ this._state = MARKDOWN; } else if(c === "<"){ let isInequality = (this._index + 1 < this._buffer.length) && this._buffer.charAt(this._index + 1) === '='; if(!isInequality){ if(this._index > this._sectionStart){ this._cbs.ontext(this._getSection()); } this._state = BEFORE_TAG_NAME; this._sectionStart = this._index; } } }

acjh · 2018-02-01T16:05:08Z

lib/Tokenizer.js

@@ -6,7 +6,8 @@ var decodeCodePoint = require("entities/lib/decode_codepoint.js"),
    xmlMap    = require("entities/maps/xml.json"),

    i = 0,
-
+


Remove whitespace.

acjh · 2018-02-01T16:06:04Z

lib/Tokenizer.js

+	} else if(c === "<"){
+		var isInequality = (this._index + 1 < this._buffer.length) && (this._buffer.charAt(this._index + 1) === '=');
+		if(!isInequality){
+			if (this._index > this._sectionStart) {


No spaces before/after parentheses 😢

acjh · 2018-02-02T06:18:10Z

lib/Tokenizer.js

@@ -6,7 +6,7 @@ var decodeCodePoint = require("entities/lib/decode_codepoint.js"),
    xmlMap    = require("entities/maps/xml.json"),

    i = 0,
-


Restore newline (without whitespace).

You added whitespace again 😕

Sorry, fixed it now.

Gisonrg · 2018-02-02T14:57:06Z

Have we test the code block (```) case?

danielbrzn · 2018-02-02T16:37:48Z

Do you mean whether code block cases render as before?

Just tried this out, seems to be fine. Is there something else I should test?

In the current version of the CS2103 website however, this fix will cause the rest of the page to not render as intended as there's an extra backtick; specifically in this page under the code snippet where it says
//Solution below adpated from https://stackoverflow.com/a/16252290`

If this backtick is removed, the page renders as per normal.

damithc · 2018-02-03T02:01:40Z

In the current version of the CS2103 website however, this fix will cause the rest of the page to not render as intended as there's an extra backtick;

Removed the extra backtick.

Gisonrg

Great work :P

Let's patch Tokenizer to treat Markdown code as text instead of HTML. From MarkBind/htmlparser2#1: > This fix allows Markdown code to contain '<' , '<=' without having it > affect other HTML elements as it is now treated as a text element. > Furthermore, no spaces are required when typing these symbols within > the back ticks. > > As such, inequalities like the above can be rendered normally as > shown below. > > `x<y` > `<` > `<=` > `x<=y`

acjh requested changes Jan 25, 2018

View reviewed changes

danielbrzn force-pushed the markdown-parsing-fix branch from c3efc5a to 28804b6 Compare January 25, 2018 17:12

acjh reviewed Jan 25, 2018

View reviewed changes

danielbrzn force-pushed the markdown-parsing-fix branch from 28804b6 to 43fe9b5 Compare January 25, 2018 17:31

Update Tokenizer to treat Markdown code as text instead of HTML

3880c5c

- Allows Markdown code to contain '<' , '<=' without having it affect other HTML elements

danielbrzn force-pushed the markdown-parsing-fix branch from 43fe9b5 to 3880c5c Compare January 25, 2018 18:00

acjh approved these changes Jan 26, 2018

View reviewed changes

acjh requested a review from Gisonrg January 26, 2018 04:00

acjh requested changes Jan 27, 2018

View reviewed changes

danielbrzn force-pushed the markdown-parsing-fix branch from 38ac8da to 21ddebe Compare January 27, 2018 06:21

acjh requested changes Jan 27, 2018

View reviewed changes

danielbrzn force-pushed the markdown-parsing-fix branch from 21ddebe to ed1f971 Compare January 27, 2018 14:59

acjh requested changes Feb 1, 2018

View reviewed changes

danielbrzn force-pushed the markdown-parsing-fix branch from ed1f971 to c09d6ee Compare February 1, 2018 05:12

danielbrzn force-pushed the markdown-parsing-fix branch from c09d6ee to f11e76a Compare February 1, 2018 05:20

acjh reviewed Feb 1, 2018

View reviewed changes

acjh added this to the v3.10.0-markbind.1 milestone Feb 1, 2018

acjh mentioned this pull request Feb 1, 2018

Update package.json to point to htmlparser2 fork MarkBind/markbind#126

Merged

danielbrzn force-pushed the markdown-parsing-fix branch 2 times, most recently from 6943bc6 to 7aecd9b Compare February 1, 2018 15:34

danielbrzn force-pushed the markdown-parsing-fix branch from 7aecd9b to 369b0ba Compare February 1, 2018 15:37

acjh requested changes Feb 1, 2018

View reviewed changes

danielbrzn force-pushed the markdown-parsing-fix branch from 369b0ba to b348228 Compare February 1, 2018 16:41

acjh requested changes Feb 2, 2018

View reviewed changes

danielbrzn force-pushed the markdown-parsing-fix branch from b348228 to 89cde72 Compare February 2, 2018 07:18

Update Tokenizer to recognise inequalities and parse them as text

6e614fb

danielbrzn force-pushed the markdown-parsing-fix branch from 89cde72 to 6e614fb Compare February 2, 2018 14:55

acjh approved these changes Feb 3, 2018

View reviewed changes

Gisonrg approved these changes Feb 3, 2018

View reviewed changes

acjh merged commit 815b507 into MarkBind:master Feb 3, 2018

acjh mentioned this pull request Dec 7, 2019

Patch htmlparser2 instead of rely on MarkBind fork MarkBind/markbind#948

Merged

ang-zeyu mentioned this pull request Dec 30, 2020

Remove markdown - htmlparser2 patch MarkBind/markbind#1435

Merged

10 tasks

		@@ -6,7 +6,8 @@ var decodeCodePoint = require("entities/lib/decode_codepoint.js"),
		xmlMap = require("entities/maps/xml.json"),

		i = 0,

		@@ -6,7 +6,7 @@ var decodeCodePoint = require("entities/lib/decode_codepoint.js"),
		xmlMap = require("entities/maps/xml.json"),

		i = 0,

Update Tokenizer to treat Markdown code as text instead of HTML #1

Update Tokenizer to treat Markdown code as text instead of HTML #1

Conversation

danielbrzn commented Jan 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acjh commented Jan 25, 2018

danielbrzn commented Jan 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acjh commented Jan 26, 2018

danielbrzn commented Jan 26, 2018 • edited Loading

acjh commented Jan 26, 2018

danielbrzn commented Jan 26, 2018

acjh commented Jan 26, 2018

danielbrzn commented Jan 27, 2018

acjh commented Jan 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielbrzn commented Feb 1, 2018 • edited Loading

acjh Feb 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gisonrg commented Feb 2, 2018

danielbrzn commented Feb 2, 2018

damithc commented Feb 3, 2018

Gisonrg left a comment

Choose a reason for hiding this comment

danielbrzn commented Jan 25, 2018 •

edited

Loading

danielbrzn commented Jan 25, 2018 •

edited

Loading

danielbrzn commented Jan 26, 2018 •

edited

Loading

danielbrzn commented Feb 1, 2018 •

edited

Loading

acjh Feb 1, 2018 •

edited

Loading