htmlescape markdown #10061

hayd · 2015-02-03T23:10:02Z

Throwing this out there, IMO html escaping is required for rendering markdown in html.

~~Note: This breaks the Ref test, so I'm not sure what I should do to fix it?~~

I need to read the Markdown spec a little more carefully, I'm not sure what the precise rules for escaping are. http://spec.commonmark.org/ I think I'm doing a little too much here (e.g. I wonder if links need different behaviour).

This was mentioned in #5239 and #9933.

MikeInnes · 2015-02-04T21:19:33Z

This is the right idea, but it would be good to see a more efficient implementation. I think it might work nicely to have a printesc(io, s) function which checks each char before putting it into the stream.

If &.*; forms need to be handled specially (github does this at least), maybe it would be more robust to handle that case with an inline parser.

hayd · 2015-02-04T22:20:59Z

Does this mean there need to be two levels of escaping (another for &.*;) - when should these be escaped?

Agree about efficiency, this strategy was precisely what @ivarne suggested on the other issue.

ivarne · 2015-02-05T10:13:31Z

printesc is a somewhat generic name. Maybe we want a parameter to distinguish what type of escaping we want? (HTML could be default)

MikeInnes · 2015-02-05T10:30:00Z

Since this is (at least for now) only an implementation detail of Markdown.jl I'm not too worried about naming or genericity. Having a uber-generic escaping system in Base may well be a reasonable goal but I'd rather it didn't block solving this immediate problem.

(Though genericity may also not be the right approach if markdown's escaping requirements are different from other formats, for example).

hayd · 2015-02-05T10:32:59Z

presumably htmlescape(io, s) would be fine too. Is the _htmlescape_chars approach reasonable (would the current impl be efficient enough if using a stream)?

MikeInnes · 2015-02-05T10:43:53Z

Keeping the chars in a dict seems reasonable to me, the only thing is that it would be nice if the &, # and ; chars were added programmatically.

MikeInnes · 2015-02-05T10:44:40Z

base/markdown/render/html.jl

@@ -8,6 +8,20 @@ function withtag(f, io, tag)
    print(io, "</$tag>")
 end

+const _htmlescape_chars = Dict('<'=>"&lt;", '>'=>"&gt;",
+                               '"'=>"&quot;", '\''=>"&#39",


That would avoid errors missing semicolon errors (the bane of a programmers life, of course).

hayd · 2015-02-11T21:24:05Z

@one-more-minute updated, is this more like what you were intending?

Note: the test still fails on the Ref test - not sure how to fix that.

hayd · 2015-02-24T08:26:20Z

Bump! Looking at this I don't see why this would affect the Ref test (which is purely on the md_str not html) :s

MikeInnes · 2015-02-25T20:34:01Z

This basically looks good but the test is concerning. Maybe try rebasing? I can look into it more thoroughly but it might be a while before I have time.

MikeInnes · 2015-02-25T20:35:38Z

base/markdown/render/html.jl

@@ -5,7 +5,7 @@ include("rich.jl")
 function withtag(f, io::IO, tag, attrs...)
    print(io, "<$tag")
    for (attr, value) in attrs
-        print(io, " $attr=\"$value\"")
+        print(io, " ", htmlescape(attr), "=\"", htmlescape(value), "\"")


Do attributes ever need to be escaped? As far as I know something special characters aren't valid attribute names anyway, and since the values are quoted they shouldn't need escaping.

They could themselves include quotes, so I think they do need escaping. ?

hayd · 2015-02-25T20:36:36Z

@one-more-minute the test is an legitimate failure due to Ref. I'm really not sure what I'm missing in that example ? :s (Edit: which is to say, I really don't see how this touches it as there is nothing html-y about that test.)

MikeInnes · 2015-02-25T20:47:29Z

Maybe try seeing if there's a difference between the html output for md"Behaves like $(ref(fft))" and md"Behaves like fft (see Julia docs)" after this change? This PR shouldn't have changed the output but it looks like it must've done.

hayd · 2015-02-25T20:51:47Z

From the travis logs:

ERROR: LoadError: LoadError: test failed:
  (Base.Markdown.MD(Any[Base.Markdown.Paragraph(Any["Behaves like ",Reference(fft)])],Dict{Any,Any}())
== Base.Markdown.MD(Any[Base.Markdown.Paragraph(Any["Behaves like fft (see Julia docs)"])],Dict{Any,Any}()))

hayd · 2015-02-25T20:54:07Z

Tbh, I don't understand how that passed before! Looks like should be comparing plain(md...) ?

MikeInnes · 2015-02-25T20:55:24Z

It passed before because the HTML output was the same for both (which is how equality is defined for MD objects). So this PR must be changing the HTML output somehow, I think.

hayd · 2015-02-25T21:20:43Z

Hmmm, it appears to also not work for plain either (when it did before):

julia> md"Behaves like $(ref(fft))"
  Behaves like Reference(fft)

julia> md"Behaves like fft (see Julia docs)"
  Behaves like fft (see Julia docs)

julia> html(md"Behaves like $(ref(fft))")
"<p>Behaves like Reference(fft)</p>\n"

julia> html(md"Behaves like fft (see Julia docs)")  # Note also brackets correctly escaped
"<p>Behaves like fft &#40;see Julia docs&#41;</p>\n"

and just the Reference itself (which should expand!):

julia> print(r)
Reference(fft)

@one-more-minute Any idea what am I missing?

hayd · 2015-02-26T00:17:35Z

@one-more-minute Ok, have a fix for this (should pass now). Sorry was easier than I thought. Thanks for your direction!

There is one weird definition I noticed:

plain(io::IO, x) = tohtml(io, x)

This leads to the following if writemime for test/html is defined (as in the new test):

julia> r = ref(fft)  # plain works, as this is the fallback for terminal (I think)
fft (see Julia docs)

julia> plain(r)  # this isn't plain!
"<a href=\"test\">fft &#40;see Julia docs&#41;</a>"

Of course, you actually use plaininline so it doesn't matter here (so the test etc. works), but this definition still seems odd (when would you want it?).

MikeInnes · 2015-02-26T10:05:59Z

I'm not totally sure why that fallback is there, maybe it should be writemime.

I think it might be better to figure out what's breaking the test, rather than changing the test so it passes. I think what must be happening is that when tohtml falls back for text/plain it doesn't escape the resulting text.

MikeInnes · 2015-02-26T10:06:42Z

base/markdown/render/html.jl

+const _htmlescape_chars = Dict('<'=>"&lt;", '>'=>"&gt;",
+                               '"'=>"&quot;", # ' '=>"&nbsp;",
+                               )
+for ch in "'`!@\$\%()=+{}[]"


I'd add & here as well

hayd · 2015-02-27T02:03:57Z

@one-more-minute ok, I have a fix for these....

There was a minor issue ~~(as you can see in the tests)~~ the plain text writemime of Ref was not escaped. fixed this is, but I'm not sure what the syntax is to pipe through the output of tohtml through printesc (i.e. without sprint):

printesc(io, sprint(writemime, m, x))  # here m is MIME"text/plain"

How can I do that without the sprint. ?

Note: I was seeing the following strange behaviour but I think it's fixed (possible due to to my misuse/redefining Ref in the repl) :

julia> r = ref(fft)
Reference(fft)

julia> Markdown.bestmime(r)
MIME type text/plain

julia> Markdown.tohtml(STDOUT, Markdown.bestmime(r), r)
Reference(fft)

julia> writemime(STDOUT, Markdown.bestmime(r), r)
fft (see Julia docs)

The definition of tohtml here is to call writemime (!) so this behaviour is bizarre to me.

hayd · 2015-03-05T19:31:08Z

base/markdown/render/rich.jl

@@ -3,7 +3,7 @@ function tohtml(io::IO, m::MIME"text/html", x)
 end

 function tohtml(io::IO, m::MIME"text/plain", x)
-    writemime(io, m, x)
+    printesc(io, sprint(writemime, m, x))


@one-more-minute is there a way to do this line without sprint-ing (i.e. in one pass)?

Any other comments ?

@one-more-minute sorry, I still don't understand the fix you're eluding to here/below!

@one-more-minute Another example where this kind of thing is useful is if we had newline characters in the MD (which the spec suggests), in html you want to print them as new lines BUT in the terminal as spaces (as html is rendered). Somehow:

remove_newlines(io, terminalinline(..., x))

hayd · 2015-03-12T20:28:53Z

another extension for this would be to escape latex for #10494

hayd · 2015-03-13T19:53:29Z

worth mentioning there is already a print_escaped method in Base (for escaping C and unicode escape sequences). Currently that only has one method:

print_escaped(io,s::AbstractString,esc::AbstractString) at string.jl:872

...

MikeInnes · 2015-03-14T15:50:48Z

The way to do escaping without rendering to a string first would be to use the IO wrapper approach, like the one @ivarne suggested elsewhere. That does make things a bit more complicated, unfortunately, but I think it's the only way to do it completely efficiently.

hayd · 2015-03-14T17:18:38Z

Doesn't this PR already do that? or are you talking about the inline comment above? Not sure what you're referring to.

MikeInnes · 2015-03-14T18:26:04Z

Yeah woops, I should've replied to the inline comment. Specifically the thing about passing sprint(writemime, m, x) into printesc that you asked about.

hayd · 2015-03-17T00:53:03Z

@one-more-minute tbh I think this should be put back to htmlesc and then think about exporting /generalising. Not sure on good API for C, html, latex, ... whatever else? (edit: regex escape would also be useful.)

The current/master print_escape isn't great.

hayd · 2015-03-20T05:38:18Z

@one-more-minute I've rebased and changed back to htmlesc, in line with the latex pr: Thinking up a more general printesc/print_escape can be done in the future.

Would like to get this merged so can PR some other changes to move towards being closer to commonmark. :)

MikeInnes · 2015-03-20T09:47:12Z

Ok, sure – I'll probably have a go at the buffer approach myself at some point.

MikeInnes · 2015-03-20T09:47:51Z

Ok, there are merge conflicts at the moment though, any chance you could do a quick rebase?

hayd · 2015-03-20T15:07:14Z

rebased/pushed (was conflict in test/markdown.jl)

hayd · 2015-03-20T16:59:18Z

I don't think the appveyor failure/osx timeout are to do with me!

htmlescape markdown

MikeInnes reviewed Feb 5, 2015
View reviewed changes

hayd force-pushed the htmlescape branch from 2dbd24b to fe2f1a6 Compare February 11, 2015 19:43

MikeInnes reviewed Feb 25, 2015
View reviewed changes

hayd force-pushed the htmlescape branch from 2998db3 to 8b7285b Compare February 26, 2015 00:10

hayd force-pushed the htmlescape branch from 8b7285b to dd7f42a Compare February 26, 2015 06:04

MikeInnes reviewed Feb 26, 2015
View reviewed changes

hayd force-pushed the htmlescape branch from c15243e to 5662ace Compare February 27, 2015 02:01

hayd reviewed Mar 5, 2015
View reviewed changes

hayd mentioned this pull request Mar 12, 2015

Escape special characters JuliaAttic/Markdown.jl#17

Open

MikeInnes self-assigned this Mar 14, 2015

hayd force-pushed the htmlescape branch 2 times, most recently from 96d9846 to 2eac851 Compare March 15, 2015 07:30

hayd force-pushed the htmlescape branch 2 times, most recently from cb85324 to c2a0051 Compare March 20, 2015 06:49

enh escape html in markdown

30f2806

hayd force-pushed the htmlescape branch from c2a0051 to 30f2806 Compare March 20, 2015 15:05

MikeInnes added a commit that referenced this pull request Mar 20, 2015

Merge pull request #10061 from hayd/htmlescape

e13c9be

htmlescape markdown

MikeInnes merged commit e13c9be into JuliaLang:master Mar 20, 2015

htmlescape markdown #10061

htmlescape markdown #10061

Conversation

hayd commented Feb 3, 2015

MikeInnes commented Feb 4, 2015

hayd commented Feb 4, 2015

ivarne commented Feb 5, 2015

MikeInnes commented Feb 5, 2015

hayd commented Feb 5, 2015

MikeInnes commented Feb 5, 2015

MikeInnes Feb 5, 2015

Choose a reason for hiding this comment

hayd commented Feb 11, 2015

hayd commented Feb 24, 2015

MikeInnes commented Feb 25, 2015

MikeInnes Feb 25, 2015

Choose a reason for hiding this comment

hayd Feb 25, 2015

Choose a reason for hiding this comment

hayd commented Feb 25, 2015

MikeInnes commented Feb 25, 2015

hayd commented Feb 25, 2015

hayd commented Feb 25, 2015

MikeInnes commented Feb 25, 2015

hayd commented Feb 25, 2015

hayd commented Feb 26, 2015

MikeInnes commented Feb 26, 2015

MikeInnes Feb 26, 2015

Choose a reason for hiding this comment

hayd commented Feb 27, 2015

hayd Mar 5, 2015

Choose a reason for hiding this comment

hayd Mar 18, 2015

Choose a reason for hiding this comment

hayd Mar 21, 2015

Choose a reason for hiding this comment

hayd commented Mar 12, 2015

hayd commented Mar 13, 2015

MikeInnes commented Mar 14, 2015

hayd commented Mar 14, 2015

MikeInnes commented Mar 14, 2015

hayd commented Mar 17, 2015

hayd commented Mar 20, 2015

MikeInnes commented Mar 20, 2015

MikeInnes commented Mar 20, 2015

hayd commented Mar 20, 2015

hayd commented Mar 20, 2015