Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

< parsing bug #306

Closed
eldiablolives opened this issue Dec 9, 2014 · 7 comments
Closed

< parsing bug #306

eldiablolives opened this issue Dec 9, 2014 · 7 comments

Comments

@eldiablolives
Copy link

The parser with just hang and explode without any message when it encounters < in text (not <)

so for example if you have a statement in html text
..some html...
hello
< world
.. some html...

it will crash with no notification.

I'm testing it on NodeJS

@kangax
Copy link
Owner

kangax commented Dec 9, 2014

Probably because it tries to parse it as a start tag. Better error notifications are on radar (we're relying on old-ish but heavily modified html parser by John Resig). Please feel free to contribute.

@eldiablolives
Copy link
Author

yeah, i figured, I think its due to not being able to find a closing tag >, and then other tags come in and the whole thing explodes (but silently, which is the problem in fact) I'd contribute some code but I'm late on my delivery. I got the system to run btw, by removing all the <'s the problem is if ever it bombs along the line I won't have a way to know. I'd suggest instead of returning a parsed text as var res = myFunc(), you do var = res myFunc(..., callback -> (err, html)) then you're backward compatible and ready for error handling (even ugly one) just as long as it doesn't explode

@kangax
Copy link
Owner

kangax commented Dec 9, 2014

Well, technically <, >, etc. should simply be escaped. That's always the best way to go (it's just plain better for compatibility among any HTML environments, parsers, etc.)

@eldiablolives
Copy link
Author

that's a theological debate, of course it should and if you're doing simple html pages that's easy to run but if you're doing a server script that includes dozens of templates, then you add some DB content in the mix and you pull external html-ready content (syndication) you get a hodgepodge of html, which is the key reason why one would want it all minimised as it looks like dogs dinner once is all built and riddled with comments, spaces and other crap. We can't just assume coders (like me) are not idiots, that's wishful thinking on the other end as all programs should be able to exit gracefully and handle their own shit, i think your library is the big daddy, it does the job amazingly i've got 50ms/page end to end (which includes EJS, db and other back end nonsense) and thats pretty good in my book on my macAir. I promise one day I'll sit down and write a decent html parser (i've been promising that to myself for the last 10+ years lol)

@kangax
Copy link
Owner

kangax commented Dec 9, 2014

Yeah, I know how hairy it could get. We'll definitely make those errors more descriptive; hopefully sooner than in 10 years :P

@fregante
Copy link
Contributor

fregante commented Jul 1, 2015

Similar to #375

Perhaps instead of allowing invalid HTML and prompting a visit from Zalgo, if you expect possibly-invalid HTML files, maybe pass them through something like HTML Tidy first.

@kangax
Copy link
Owner

kangax commented Jul 3, 2015

Duplicate of #332

@kangax kangax closed this as completed Jul 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants