Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htmlescape markdown #10061

Merged
merged 1 commit into from
Mar 20, 2015
Merged

htmlescape markdown #10061

merged 1 commit into from
Mar 20, 2015

Conversation

hayd
Copy link
Member

@hayd hayd commented Feb 3, 2015

Throwing this out there, IMO html escaping is required for rendering markdown in html.

cc @one-more-minute @MichaelHatherly

Note: This breaks the Ref test, so I'm not sure what I should do to fix it?

I need to read the Markdown spec a little more carefully, I'm not sure what the precise rules for escaping are. http://spec.commonmark.org/ I think I'm doing a little too much here (e.g. I wonder if links need different behaviour).

This was mentioned in #5239 and #9933.

@MikeInnes
Copy link
Member

This is the right idea, but it would be good to see a more efficient implementation. I think it might work nicely to have a printesc(io, s) function which checks each char before putting it into the stream.

If &.*; forms need to be handled specially (github does this at least), maybe it would be more robust to handle that case with an inline parser.

@hayd
Copy link
Member Author

hayd commented Feb 4, 2015

Does this mean there need to be two levels of escaping (another for &.*;) - when should these be escaped?

Agree about efficiency, this strategy was precisely what @ivarne suggested on the other issue.

@ivarne
Copy link
Sponsor Member

ivarne commented Feb 5, 2015

printesc is a somewhat generic name. Maybe we want a parameter to distinguish what type of escaping we want? (HTML could be default)

@MikeInnes
Copy link
Member

Since this is (at least for now) only an implementation detail of Markdown.jl I'm not too worried about naming or genericity. Having a uber-generic escaping system in Base may well be a reasonable goal but I'd rather it didn't block solving this immediate problem.

(Though genericity may also not be the right approach if markdown's escaping requirements are different from other formats, for example).

@hayd
Copy link
Member Author

hayd commented Feb 5, 2015

presumably htmlescape(io, s) would be fine too. Is the _htmlescape_chars approach reasonable (would the current impl be efficient enough if using a stream)?

@MikeInnes
Copy link
Member

Keeping the chars in a dict seems reasonable to me, the only thing is that it would be nice if the &, # and ; chars were added programmatically.

@@ -8,6 +8,20 @@ function withtag(f, io, tag)
print(io, "</$tag>")
end

const _htmlescape_chars = Dict('<'=>"&lt;", '>'=>"&gt;",
'"'=>"&quot;", '\''=>"&#39",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would avoid errors missing semicolon errors (the bane of a programmers life, of course).

@hayd
Copy link
Member Author

hayd commented Feb 11, 2015

@one-more-minute updated, is this more like what you were intending?

Note: the test still fails on the Ref test - not sure how to fix that.

@hayd
Copy link
Member Author

hayd commented Feb 24, 2015

Bump! Looking at this I don't see why this would affect the Ref test (which is purely on the md_str not html) :s

@MikeInnes
Copy link
Member

This basically looks good but the test is concerning. Maybe try rebasing? I can look into it more thoroughly but it might be a while before I have time.

@@ -5,7 +5,7 @@ include("rich.jl")
function withtag(f, io::IO, tag, attrs...)
print(io, "<$tag")
for (attr, value) in attrs
print(io, " $attr=\"$value\"")
print(io, " ", htmlescape(attr), "=\"", htmlescape(value), "\"")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do attributes ever need to be escaped? As far as I know something special characters aren't valid attribute names anyway, and since the values are quoted they shouldn't need escaping.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could themselves include quotes, so I think they do need escaping. ?

@hayd
Copy link
Member Author

hayd commented Feb 25, 2015

@one-more-minute the test is an legitimate failure due to Ref. I'm really not sure what I'm missing in that example ? :s (Edit: which is to say, I really don't see how this touches it as there is nothing html-y about that test.)

@MikeInnes
Copy link
Member

Maybe try seeing if there's a difference between the html output for md"Behaves like $(ref(fft))" and md"Behaves like fft (see Julia docs)" after this change? This PR shouldn't have changed the output but it looks like it must've done.

@hayd
Copy link
Member Author

hayd commented Feb 25, 2015

From the travis logs:

ERROR: LoadError: LoadError: test failed:
  (Base.Markdown.MD(Any[Base.Markdown.Paragraph(Any["Behaves like ",Reference(fft)])],Dict{Any,Any}())
== Base.Markdown.MD(Any[Base.Markdown.Paragraph(Any["Behaves like fft (see Julia docs)"])],Dict{Any,Any}()))

@hayd
Copy link
Member Author

hayd commented Feb 25, 2015

Tbh, I don't understand how that passed before! Looks like should be comparing plain(md...) ?

@MikeInnes
Copy link
Member

It passed before because the HTML output was the same for both (which is how equality is defined for MD objects). So this PR must be changing the HTML output somehow, I think.

@hayd
Copy link
Member Author

hayd commented Feb 25, 2015

Hmmm, it appears to also not work for plain either (when it did before):

julia> md"Behaves like $(ref(fft))"
  Behaves like Reference(fft)

julia> md"Behaves like fft (see Julia docs)"
  Behaves like fft (see Julia docs)

julia> html(md"Behaves like $(ref(fft))")
"<p>Behaves like Reference(fft)</p>\n"

julia> html(md"Behaves like fft (see Julia docs)")  # Note also brackets correctly escaped
"<p>Behaves like fft &#40;see Julia docs&#41;</p>\n"

and just the Reference itself (which should expand!):

julia> print(r)
Reference(fft)

@one-more-minute Any idea what am I missing?

@hayd
Copy link
Member Author

hayd commented Feb 26, 2015

@one-more-minute Ok, have a fix for this (should pass now). Sorry was easier than I thought. Thanks for your direction!


There is one weird definition I noticed:

plain(io::IO, x) = tohtml(io, x)

This leads to the following if writemime for test/html is defined (as in the new test):

julia> r = ref(fft)  # plain works, as this is the fallback for terminal (I think)
fft (see Julia docs)

julia> plain(r)  # this isn't plain!
"<a href=\"test\">fft &#40;see Julia docs&#41;</a>"

Of course, you actually use plaininline so it doesn't matter here (so the test etc. works), but this definition still seems odd (when would you want it?).

@MikeInnes
Copy link
Member

I'm not totally sure why that fallback is there, maybe it should be writemime.

I think it might be better to figure out what's breaking the test, rather than changing the test so it passes. I think what must be happening is that when tohtml falls back for text/plain it doesn't escape the resulting text.

const _htmlescape_chars = Dict('<'=>"&lt;", '>'=>"&gt;",
'"'=>"&quot;", # ' '=>"&nbsp;",
)
for ch in "'`!@\$\%()=+{}[]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add & here as well

@hayd
Copy link
Member Author

hayd commented Feb 27, 2015

@one-more-minute ok, I have a fix for these....

There was a minor issue (as you can see in the tests) the plain text writemime of Ref was not escaped. fixed this is, but I'm not sure what the syntax is to pipe through the output of tohtml through printesc (i.e. without sprint):

printesc(io, sprint(writemime, m, x))  # here m is MIME"text/plain"

How can I do that without the sprint. ?


Note: I was seeing the following strange behaviour but I think it's fixed (possible due to to my misuse/redefining Ref in the repl) :

julia> r = ref(fft)
Reference(fft)

julia> Markdown.bestmime(r)
MIME type text/plain

julia> Markdown.tohtml(STDOUT, Markdown.bestmime(r), r)
Reference(fft)

julia> writemime(STDOUT, Markdown.bestmime(r), r)
fft (see Julia docs)

The definition of tohtml here is to call writemime (!) so this behaviour is bizarre to me.

@@ -3,7 +3,7 @@ function tohtml(io::IO, m::MIME"text/html", x)
end

function tohtml(io::IO, m::MIME"text/plain", x)
writemime(io, m, x)
printesc(io, sprint(writemime, m, x))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@one-more-minute is there a way to do this line without sprint-ing (i.e. in one pass)?

Any other comments ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@one-more-minute sorry, I still don't understand the fix you're eluding to here/below!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@one-more-minute Another example where this kind of thing is useful is if we had newline characters in the MD (which the spec suggests), in html you want to print them as new lines BUT in the terminal as spaces (as html is rendered). Somehow:

remove_newlines(io, terminalinline(..., x))

@hayd
Copy link
Member Author

hayd commented Mar 12, 2015

another extension for this would be to escape latex for #10494

@hayd
Copy link
Member Author

hayd commented Mar 13, 2015

worth mentioning there is already a print_escaped method in Base (for escaping C and unicode escape sequences). Currently that only has one method:

print_escaped(io,s::AbstractString,esc::AbstractString) at string.jl:872

...

@MikeInnes
Copy link
Member

The way to do escaping without rendering to a string first would be to use the IO wrapper approach, like the one @ivarne suggested elsewhere. That does make things a bit more complicated, unfortunately, but I think it's the only way to do it completely efficiently.

@hayd
Copy link
Member Author

hayd commented Mar 14, 2015

Doesn't this PR already do that? or are you talking about the inline comment above? Not sure what you're referring to.

@MikeInnes
Copy link
Member

Yeah woops, I should've replied to the inline comment. Specifically the thing about passing sprint(writemime, m, x) into printesc that you asked about.

@MikeInnes MikeInnes self-assigned this Mar 14, 2015
@hayd hayd force-pushed the htmlescape branch 2 times, most recently from 96d9846 to 2eac851 Compare March 15, 2015 07:30
@hayd
Copy link
Member Author

hayd commented Mar 17, 2015

@one-more-minute tbh I think this should be put back to htmlesc and then think about exporting /generalising. Not sure on good API for C, html, latex, ... whatever else? (edit: regex escape would also be useful.)

The current/master print_escape isn't great.

@hayd
Copy link
Member Author

hayd commented Mar 20, 2015

@one-more-minute I've rebased and changed back to htmlesc, in line with the latex pr: Thinking up a more general printesc/print_escape can be done in the future.

Would like to get this merged so can PR some other changes to move towards being closer to commonmark. :)

@hayd hayd force-pushed the htmlescape branch 2 times, most recently from cb85324 to c2a0051 Compare March 20, 2015 06:49
@MikeInnes
Copy link
Member

Ok, sure – I'll probably have a go at the buffer approach myself at some point.

@MikeInnes
Copy link
Member

Ok, there are merge conflicts at the moment though, any chance you could do a quick rebase?

@hayd
Copy link
Member Author

hayd commented Mar 20, 2015

rebased/pushed (was conflict in test/markdown.jl)

@hayd
Copy link
Member Author

hayd commented Mar 20, 2015

I don't think the appveyor failure/osx timeout are to do with me!

MikeInnes added a commit that referenced this pull request Mar 20, 2015
@MikeInnes MikeInnes merged commit e13c9be into JuliaLang:master Mar 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants