Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Tokenizer to treat Markdown code as text instead of HTML #1

Merged
merged 2 commits into from
Feb 3, 2018

Conversation

danielbrzn
Copy link

@danielbrzn danielbrzn commented Jan 25, 2018

This fix allows Markdown code to contain '<' , '<=' without having it affect other HTML elements as it is now treated as a text element. Furthermore, no spaces are required when typing these symbols within the back ticks.

As such, inequalities like the above can be rendered normally as shown below.

image

Resolves MarkBind/markbind#101

lib/Tokenizer.js Outdated
@@ -144,10 +144,12 @@ function Tokenizer(options, cbs){
this._ended = false;
this._xmlMode = !!(options && options.xmlMode);
this._decodeEntities = !!(options && options.decodeEntities);
this._isMarkdownCode = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tabs vs spaces 😨

Let's be consistent with the rest of the file (tabs).

lib/Tokenizer.js Outdated
@@ -635,6 +637,9 @@ Tokenizer.prototype.write = function(chunk){
Tokenizer.prototype._parse = function(){
while(this._index < this._buffer.length && this._running){
var c = this._buffer.charAt(this._index);
// Detect Markdown code so that it is parsed as text instead of HTML
if (c === '`')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No space before opening parentheses 😢

Let's be consistent with L152 of the file (braces even for single line of code).

@acjh
Copy link
Collaborator

acjh commented Jan 25, 2018

Off-topic: Add a white space around operators :)

x < y
x <= y

We don't have a JS coding standard but:

@danielbrzn
Copy link
Author

danielbrzn commented Jan 25, 2018

Updated with the requested changes, somehow my WebStorm was set to indent with spaces and I didn't manage to catch the difference in the editor.

Thanks for the tip about the white space!

lib/Tokenizer.js Outdated
}

Tokenizer.prototype._stateText = function(c){
if(c === "<"){
// parse open tags if it is not Markdown
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Parse (capital P)
  • tag (singular)

lib/Tokenizer.js Outdated
@@ -635,6 +637,10 @@ Tokenizer.prototype.write = function(chunk){
Tokenizer.prototype._parse = function(){
while(this._index < this._buffer.length && this._running){
var c = this._buffer.charAt(this._index);
// Detect Markdown code so that it is parsed as text instead of HTML
if (c === '`') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No spaces before/after parentheses.

- Allows Markdown code to contain '<' , '<=' without having it affect other HTML elements
@acjh acjh requested a review from Gisonrg January 26, 2018 04:00
@acjh
Copy link
Collaborator

acjh commented Jan 26, 2018

We should treat <␣ and <= as text as well:

  • a < b
  • a <= b

This is reasonable: create a .html file with the above and open it in your browser (tested in Chrome).

@danielbrzn
Copy link
Author

danielbrzn commented Jan 26, 2018

Seems like the first case is handled fine. I've modified Tokenizer.js to treat <= as text, but there's a peculiar bug with the beautifying process that uses js-beautify

x <= y will get beautified to x <=y. I've tried using this fix mentioned here but it doesn't work. I'd reckon that the beautifier thinks <= is a valid open tag. Any ideas on how I could fix this?

@acjh
Copy link
Collaborator

acjh commented Jan 26, 2018

I've modified Tokenizer.js to treat <= as text, but there's a peculiar bug with the beautifying process that uses js-beautify

Can you commit and push, so we can attempt to repro?

Any ideas on how I could fix this?

Try updating js-beautify from 1.6.12 to 1.7.5 and see if the problem still exists.

@danielbrzn
Copy link
Author

js-beautify is at version 1.7.5 and the problem still persists unfortunately.

image

@acjh
Copy link
Collaborator

acjh commented Jan 26, 2018

No repro:

@danielbrzn
Copy link
Author

Are you generating the site from a index.md or a index.html? I get the bug when it's a html file, but not when it's an md file.

@acjh
Copy link
Collaborator

acjh commented Jan 27, 2018

Ah, I see that I suggested to "create a .html file" to see how the browser treats those strings.
Repro-ed when using markbind build with a .html file.

We don't have to solve that in this PR since:

  • it works with .md files which we're primarily concerned with,
  • it doesn't break anything, and
  • it's not caused by bad code in this PR.

So it's partial support for .html files: Given "a <= b", this PR gives "a <=b" instead of just "a".

lib/Tokenizer.js Outdated
@@ -160,6 +163,9 @@ Tokenizer.prototype._stateText = function(c){
this._baseState = TEXT;
this._state = BEFORE_ENTITY;
this._sectionStart = this._index;
} else if(this._isInequality){
// Next character should be parsed normally
this._isInequality = !this._isInequality;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be this._isInequality = false; since it's not a toggle.

lib/Tokenizer.js Outdated
}

Tokenizer.prototype._stateText = function(c){
if(c === "<"){
// Parse open tag if it is not Markdown and not part of an inequality
if(c === "<" && !this._isMarkdownCode && !this._isInequality){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is && !this._isInequality necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such that the Tokenizer doesn't think that the < of a <= is the start of an open HTML tag.

lib/Tokenizer.js Outdated
} else if(c === '<'){
var nextChar = this._buffer.charAt(this._index + 1);
if(nextChar === '='){
this._isInequality = !this._isInequality;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be this._isInequality = true;?

lib/Tokenizer.js Outdated
@@ -144,10 +144,13 @@ function Tokenizer(options, cbs){
this._ended = false;
this._xmlMode = !!(options && options.xmlMode);
this._decodeEntities = !!(options && options.decodeEntities);
this._isMarkdownCode = false;
this._isInequality = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorder just these 2 in alphabetical order.

lib/Tokenizer.js Outdated
@@ -160,6 +163,9 @@ Tokenizer.prototype._stateText = function(c){
this._baseState = TEXT;
this._state = BEFORE_ENTITY;
this._sectionStart = this._index;
} else if(this._isInequality){
// Next character should be parsed normally
this._isInequality = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be the first if condition?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's the first if condition, this._isInequality would be set to false and then < would then be treated as a valid open tag

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't enter the else if block though?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, yes that's right. Will resolve it ASAP.

lib/Tokenizer.js Outdated
this._isMarkdownCode = !this._isMarkdownCode;
} else if(c === '<'){
var nextChar = this._buffer.charAt(this._index + 1);
if(nextChar === '='){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This needs a comment for consistency.
  • Index should also be checked: if(c === '<' && this._index + 1 < this._buffer.length){

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the index, will do so.

Should the comment be inside the else if block or outside of it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be inside if you add a section name.

lib/Tokenizer.js Outdated
if(nextChar === '='){
this._isInequality = true;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Add a newline before and after this entire block.
  • Maybe add a section name like the ones below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By section name, do you mean changing '=' into something like EQUALS?

if(nextChar === EQUALS){
    this._isInequality = true;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would special conditions be an appropriate section name?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for now.

lib/Tokenizer.js Outdated
if(this._isInequality){
// Next character will be parsed normally
this._isInequality = false;
} else if(c === "<" && !this._isMarkdownCode && !this._isInequality){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&& !this._isInequality should be removed.

lib/Tokenizer.js Outdated
* special conditions
*/
if(c === '`'){
// Detect Markdown code to be parsed as text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detect Toggle

lib/Tokenizer.js Outdated
this._isMarkdownCode = !this._isMarkdownCode;
} else if(c === '<' && this._index + 1 < this._buffer.length){
var nextChar = this._buffer.charAt(this._index + 1);
// Detect '<=' inequality to be parsed as text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detect Set

Also, move this comment into the if block.

@danielbrzn
Copy link
Author

danielbrzn commented Feb 1, 2018

Made the necessary changes.

lib/Tokenizer.js Outdated
this._isInequality = true;
}
}

Copy link
Collaborator

@acjh acjh Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this still looks out-of-place.

Let's introduce a new state MARKDOWN instead of tracking this._isMarkdownCode and this._isInequality.

Near top of file:

MARKDOWN                  = i++,
TEXT                      = i++, // No change

In this function:

if(this._state === MARKDOWN) {
	this._stateMarkdown(c);
} else if (this.state === TEXT) {
	this._stateText(c); // No change

Other functions:

Tokenizer.prototype._stateMarkdown = function(c){
	if(c === '`'){
		this._state = TEXT;
	}
}

Tokenizer.prototype._stateText = function(c){
	if(c === '`'){
		this._state = MARKDOWN;
	} else if(c === "<"){
		let isInequality = (this._index + 1 < this._buffer.length) && this._buffer.charAt(this._index + 1) === '=';
		if(!isInequality){
			if(this._index > this._sectionStart){
				this._cbs.ontext(this._getSection());
			}
			this._state = BEFORE_TAG_NAME;
			this._sectionStart = this._index;
		}
	}
}

lib/Tokenizer.js Outdated
@@ -6,7 +6,8 @@ var decodeCodePoint = require("entities/lib/decode_codepoint.js"),
xmlMap = require("entities/maps/xml.json"),

i = 0,


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove whitespace.

lib/Tokenizer.js Outdated
} else if(c === "<"){
var isInequality = (this._index + 1 < this._buffer.length) && (this._buffer.charAt(this._index + 1) === '=');
if(!isInequality){
if (this._index > this._sectionStart) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lib/Tokenizer.js Outdated
@@ -6,7 +6,7 @@ var decodeCodePoint = require("entities/lib/decode_codepoint.js"),
xmlMap = require("entities/maps/xml.json"),

i = 0,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore newline (without whitespace).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added whitespace again 😕

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, fixed it now.

@Gisonrg
Copy link

Gisonrg commented Feb 2, 2018

Have we test the code block (```) case?

@danielbrzn
Copy link
Author

Do you mean whether code block cases render as before?

image

Just tried this out, seems to be fine. Is there something else I should test?

In the current version of the CS2103 website however, this fix will cause the rest of the page to not render as intended as there's an extra backtick; specifically in this page under the code snippet where it says
//Solution below adpated from https://stackoverflow.com/a/16252290`

If this backtick is removed, the page renders as per normal.

@damithc
Copy link

damithc commented Feb 3, 2018

In the current version of the CS2103 website however, this fix will cause the rest of the page to not render as intended as there's an extra backtick;

Removed the extra backtick.

Copy link

@Gisonrg Gisonrg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work :P

@acjh acjh merged commit 815b507 into MarkBind:master Feb 3, 2018
acjh added a commit to acjh/markbind that referenced this pull request Dec 7, 2019
Let's patch Tokenizer to treat Markdown code as text instead of HTML.

From MarkBind/htmlparser2#1:

> This fix allows Markdown code to contain '<' , '<=' without having it
> affect other HTML elements as it is now treated as a text element.
> Furthermore, no spaces are required when typing these symbols within
> the back ticks.
>
> As such, inequalities like the above can be rendered normally as
> shown below.
>
> `x<y`
> `<`
> `<=`
> `x<=y`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Code snippets: text in angle brackets become lowercase
4 participants