Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

head > meta > content escaping issue #13838

Open
oliviertassinari opened this issue Oct 12, 2018 · 30 comments
Open

head > meta > content escaping issue #13838

oliviertassinari opened this issue Oct 12, 2018 · 30 comments

Comments

@oliviertassinari
Copy link
Contributor

oliviertassinari commented Oct 12, 2018

Do you want to request a feature or report a bug?

I'm guessing it's a bug.

What is the current behavior?

The following source code,

<meta property="og:image" content="https://onepixel.imgix.net/60366a63-1ac8-9626-1df8-9d8d5e5e2601_1000.jpg?auto=format&q=80&mark=watermark%2Fcenter-v5.png&markalign=center%2Cmiddle&h=500&w=500&s=60ec785603e5f71fe944f76b4dacef08" />

, is being escaped once server side rendered:

<meta property="og:image" content="https://onepixel.imgix.net/60366a63-1ac8-9626-1df8-9d8d5e5e2601_1000.jpg?auto=format&amp;q=80&amp;mark=watermark%2Fcenter-v5.png&amp;markalign=center%2Cmiddle&amp;h=500&amp;w=500&amp;s=60ec785603e5f71fe944f76b4dacef08"/>

You can reproduce the behavior like this:

const React = require("react");
const ReactDOMServer = require("react-dom/server");
const http = require("http");

const doc = React.createElement("html", {
  children: [
    React.createElement("head", {
      children: React.createElement("meta", {
        property: "og:image",
        content:
          "https://onepixel.imgix.net/60366a63-1ac8-9626-1df8-9d8d5e5e2601_1000.jpg?auto=format&q=80&mark=watermark%2Fcenter-v5.png&markalign=center%2Cmiddle&h=500&w=500&s=60ec785603e5f71fe944f76b4dacef08"
      })
    }),
    React.createElement("body", { children: "og:image" })
  ]
});

//create a server object:
http
  .createServer(function(req, res) {
    res.write("<!DOCTYPE html>" + ReactDOMServer.renderToStaticMarkup(doc)); //write a response to the client
    res.end(); //end the response
  })
  .listen(8080); //the server object listens on port 8080

editor: https://codesandbox.io/s/my299jk7qp
output : https://my299jk7qp.sse.codesandbox.io/

What is the expected behavior?

I would expect the content not being escaped. It's related to vercel/next.js#2006 (comment).
I'm using the og:image meta element so my pages can have nice previews within Facebook :).

capture d ecran 2018-10-12 a 14 15 26

Which versions of React, and which browser / OS are affected by this issue? Did this work in previous versions of React?
16.5.2

@gaearon
Copy link
Collaborator

gaearon commented Nov 1, 2018

Has this ever worked as you intend? Can you send a fix?

@oliviertassinari
Copy link
Contributor Author

We are solving the problem this way:

import Entities from 'html-entities/lib/html5-entities'

const entities = new Entities()
const contentRegExp = /content="([^"]+)"/g
const handleContent = (match, content) => {
  return `content="${entities.decode(content)}"`
}

html = html.replace(contentRegExp, handleContent)

We spend ~1ms per request in the path. It's not too bad. I can give it a look at some point.

@oliviertassinari
Copy link
Contributor Author

oliviertassinari commented Jan 14, 2019

I have found this related issue: #6873. Digging into the implementation, the behavior comes from

return attributeName + '=' + quoteAttributeValueForBrowser(value);

⬇️
return '"' + escapeTextForBrowser(value) + '"';

⬇️


Now, all the escaping tests I could find are covering the children use case:

const response = ReactDOMServer.renderToString(<span>{'&'}</span>);
expect(response).toMatch('<span data-reactroot="">&amp;</span>');

I have limited knowledge of web escaping related security issues.
I don't see any harm potential with:

 const response = ReactDOMServer.renderToString(<span data-src={'&'}></span>); 
 expect(response).toMatch('<span data-reactroot="" data-src="&"></span>'); 

@andreubotella
Copy link

andreubotella commented Mar 6, 2019

I have the same problem in the content of <style> elements:

const React = require("react");
const ReactDOMServer = require("react-dom/server");

console.log(ReactDOMServer.renderToStaticMarkup(
  <html>
    <head>
      <link
        href="https://fonts.googleapis.com/css?family=Source+Sans+Pro"
        rel="stylesheet"
      />
      <style>{`
        html {
          font-family: "Source Sans Pro", sans-serif;
        }
      `}</style>
    </head>
    <body>
      <p>Test.</p>
    </body>
  </html>
));

This outputs:

<html><head><link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro" rel="stylesheet"/><style>
        html {
          font-family: &quot;Source Sans Pro&quot;, sans-serif;
        }
      </style></head><body><p>Test.</p></body></html>

By the parsing rules in the HTML spec (I'm consulting WHATWG here), the contents of elements style, xmp and iframe (as well as noscript, noframes and noembed when they're not being rendered) are parsed with the RAWTEXT tokenizer state, which treats everything as plaintext until it finds a matching closing tag.

Escaping the contents of style elements is, however, valid (in fact, mandatory for angled brackets) in the XML syntax of HTML; and indeed, adding an xmlns="http://www.w3.org/1999/xhtml" attribute to the <html> element results in valid XML. But if the intention of ReactDOMServer is indeed to render XML syntax, that should be explicitly noted in the documentation, because there are a number of tools (such as Next.js) which serve the output of these functions with content-type text/html.

@oliviertassinari
Copy link
Contributor Author

oliviertassinari commented Mar 6, 2019

@andreubotella This is a different problem, you should use dangerouslySetInnerHTML. Can an admin mark the comments as "resolved"?

@stale
Copy link

stale bot commented Jan 10, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution.

@stale stale bot added the Resolution: Stale Automatically closed due to inactivity label Jan 10, 2020
@stale
Copy link

stale bot commented Jan 17, 2020

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

@stale stale bot closed this as completed Jan 17, 2020
@gaearon gaearon reopened this Apr 1, 2020
@stale stale bot removed the Resolution: Stale Automatically closed due to inactivity label Apr 1, 2020
@jbraithwaite
Copy link

This is not a bug in React. Using an entity reference for & (e.g. &amp;) is the correct behavior for xhtml documents:

In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&amp;"). For example, when the href attribute of the a`` element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&amp;name=user rather than as `http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user`.

In the HTML spec you do not need to use a character reference for & as long as what follows it is not a string that forms a named character reference.

The example they give is:

<a href="?bill&ted">Bill and Ted</a> <!-- &ted is ok, since it's not a named character reference -->
<a href="?art&amp;copy">Art and Copy</a> <!-- the & has to be escaped, since &copy is a named character reference -->

Personally, I feel like React made the right call with escaping & since that works in both XHTML and HTML5.

@equinusocio
Copy link

equinusocio commented Aug 18, 2020

In meta tags escaped paths don't work... otherwise, this bug would not have be opened.

@jbraithwaite
Copy link

jbraithwaite commented Aug 18, 2020

This is the change needed to get the behavior you expect:

Replace

markup = createMarkupForProperty(propKey, propValue);

with an escape hatch:

if (tagVerbatim === 'meta' && propKey === 'content') {
  markup = 'content="' + propValue + '"';
} else {
  markup = createMarkupForProperty(propKey, propValue);
}

This would explicitly exempt the meta tag's content attribute from being properly escaped which wouldn't help @oliviertassinari's issue of wanting <span data-src={'&'}></span>.

A more generic solution would involve having something like dangerouslySetAttributes

<span
  dangerouslySetAttributes={{__attributes: [{name: 'data-src', value: '&'}]}}
/>

This could easily lead to parsing errors and unexpected results if any value after the & is a named character reference e.g. &copy (without the ;)

Again, the issue was with the HTML parser FB was using for the Sharing Debugger, not React. It is properly parsing the escaped paths now.

singleportrait added a commit to singleportrait/lucia-kempkes-portfolio that referenced this issue Aug 23, 2020
parameters in the meta file

It's a known React bug, but the server-side rendering can't handle query
parameters in the meta head file; it turns `&`s into `&amp;`s, etc.

To bypass that, we've added a custom share image field to the content
model, and require that to be under 1MB so we don't have to handle it
here.

You *can* run an Express server using Next JS that will handle these
HTML entities and decode them, but Vercel (the host) won't allow that,
and also provides some custom optimization that other hosts wouldn't,
out-of-the-box. So, decided to not go that route.

Some open github issues, for reference:
- vercel/next.js#2006
- facebook/react#13838
- An example Express server fix: https://gist.github.com/stefl/1f8c246dd7ca9cb332ae41f68e80088d
@Macil
Copy link

Macil commented Mar 24, 2021

@equinusocio Anything reading the meta tags from the HTML should handle decoding the HTML encoded characters that React produces here. It's a bug with anything consuming the HTML if it doesn't decode the &amp;s correctly. If you're trying to test your work and want to read HTML-decoded value, then in dev tools element inspector, pick the meta tag element, and in the console run $0.content. That will return the proper value because the browser handles HTML decoding properly as anything parsing the HTML should.

@YoannBuzenet
Copy link

Hello, I'm having this problem and can't find any fix. I'm on React 17.0.1. Would anyone have a hint ?

@notsidney
Copy link

I’ve spent the last little while trying to figure out this issue since the encoded URLs cause LinkedIn to not see Open Graph images.

I’ve come up with a workaround for our Gatsby site that uses ReactDOMServer.renderToStaticMarkup, replacing &amp; to & in the string, then rendering using dangerouslySetInnerHTML:

const headHtml = ReactDOMServer.renderToStaticMarkup(headComponents).replace(
  /&amp;/g,
  '&'
)

return (
  <html>
    <head dangerouslySetInnerHTML={{ __html: headHtml }} />

...

And here’s the full html.js for a Gatsby project: https://github.com/notsidney/gatsby-meta-encoded-url-workaround/blob/main/src/html.js#L6-L25

@gaearon
Copy link
Collaborator

gaearon commented Sep 6, 2021

Do I understand correctly that the issue is primarily due to the content scrapers that don’t consider encoded characters? So the React output is technically correct but some scrapers are too naïve? Or is there a reason why React behavior is technically incorrect?

@ihmpavel
Copy link

ihmpavel commented Sep 6, 2021

Yes and no. For example https://resoc.io/setup will fail showing you an image if you provide encoded characters (that means this service is unusable for React users). Same behavior is for Next.JS.

Inspired by vercel/next.js#2006 (comment)
<meta property="og:image" content='/api/share?one=1&two=2' /> will become <meta property="og:image" content='/api/share?one=1&amp;two=2' />.

In the api/share.ts, (automatically) parsed request (req.query), will become an object with keys { one: 1, 'amp;two': 2 }. It is incorrectly parsed and unusable. Libraries should handle this fine, but not all of them do. It is a bigger problem, but quite unique.

Unfortunately there is no easy solution (dangerouslySetInnerHTML, decoding escaped HTML... - it does not suit for every problem).

@gaearon
Copy link
Collaborator

gaearon commented Sep 6, 2021

Yes and no

"No" to which part?

For example https://resoc.io/setup will fail showing you an image if you provide encoded characters (that means this service is unusable for React users).

I understand that particular services may not work. What I'm asking is — technically would you agree that the problem is on their end, or is the problem unambiguously on our end? I don't mean to turn this into "not our problem, bye" kind of argument, but I think we need to get clarity on the ideal situation before we can move forward with any plan here. So is React producing technically correct output (and resoc's parser is wrong) or is React producing technically incorrect output? I understand this question may not be important to your use case (it doesn't work either way) but it is important to me.

Same behavior is for Next.JS.

I don't know what you mean by this. Next.JS is using React so it's expected that it would have the same behavior. You mention "same behavior" after resoc.io which makes it sound like you're saying Next.JS is also incorrectly parsing something, but Next.JS is producing, not consuming, and it's producing via React, so I'm not sure if there's any extra meaning I'm missing here. I'd expect Next.JS to not be relevant in this issue because the issue is about React behavior itself, and discussing React alone should be enough to figure out the path forward here. But maybe I missed something!

Inspired by vercel/next.js#2006 (comment) <meta property="og:image" content='/api/share?one=1&two=2' /> will become <meta property="og:image" content='/api/share?one=1&amp;two=2' />.

OK so maybe this is what you mean by "for Next.JS". I think you're saying: "when the URL is encoded, the request received by my API built with Next.JS contains incorrect parameter". However, this assertion is missing an important detail: what exactly is sending a request to your Next.JS API? Is this Facebook? Twitter? LinkedIn? resoc.io? This is important — I need to understand which case exactly you are trying to solve, if there is a concrete one.

In the api/share.ts, (automatically) parsed request (req.query), will become an object with keys { one: 1, 'amp;two': 2 }. It is incorrectly parsed and unusable. Libraries should handle this fine, but not all of them do.

It is unclear which libraries you're referring to here. Next.JS? I believe Next.JS's behavior is correct here — if the request itself has an encoded string, Next.JS isn't supposed to unencode it. I believe unencoding should've been done by whatever thing that parsed your meta tag. This is why I'm asking what is the exact service you're trying to make this work with. And whether the same problem exists for other services now. E.g. if FB and Twitter treat this correctly, I think this could be a strong enough argument that you should report the problem to the faulty service instead, and have them fix it.

@depoulo
Copy link

depoulo commented Sep 7, 2021

In my view it's just naïve scrapers, some if which are run by Facebook, see #6873 (comment)

@devanshj
Copy link

devanshj commented Nov 3, 2021

fwiw afaict this makes you unable to write inline scripts in NextJS.

@phawk
Copy link

phawk commented Feb 23, 2022

Whilst React's approach might be technically correct, it doesn't work in the real world, because we have naive scrapers, from massive organisations like LinkedIn, Slack and WhatsApp. So you can choose to be technically correct and pedantic or to have your code work, in the real world, right now. I for one choose the latter.

I'm trying to get open graph images with query params to work, now I need to look for a hacky workaround because I can't go and force these massive tech organisations to "fix" their code.

@Macil
Copy link

Macil commented Feb 23, 2022

I'm wondering if most of the people who believe this is an issue for themselves are mistaken. I first found this issue because I was trying to figure out why Twitter wasn't showing an image from one of my <meta> tags. I viewed the HTML my server was rendering with React, copied the URL from the content="..." part of my meta tag, and found that it didn't work in my browser, so I assumed the reason that didn't work was the same reason Twitter wasn't loading my image. However, the reason that copied URL didn't work in my browser was just because I copied the raw HTML instead of the HTML-decoded string, and it was not the reason that Twitter wasn't loading my image. The actual issue was that I wasn't using the proper set of <meta> tags that Twitter respected. (There were no entries in my request logs from Twitter. If the issue had been that they weren't properly doing HTML decoding, then I would have seen entries in the request logs for URLs containing &amp; where there should have been &.)

I also wonder if some people are improperly HTML-encoding some strings before giving them to React and then thinking the double-encoding is caused by React doing HTML-encoding. People of both of these situations are present in the related thread #6873, so it seems likely that at least a decent portion of people in this thread have the same confusions.

Before suggesting putting workarounds in React for other services' parsers, we should try to be sure the problem is not one of these other dev-side confusions, we should figure out a list of services we're sure have the issue, we should try to report the issue to them first, and only then move onto figuring out workarounds if the list of services with problems is still large.


Several people have suggested that LinkedIn is unable to handle properly HTML-encoded meta tags, but from my tests they seem to handle things fine. I created this test HTML page:

<!DOCTYPE html>
<html>
  <head>
    <title>test page</title>
    <meta property="og:title" content="Title of the article" />
    <meta property="og:image" content="/1234567.png?a=b&amp;c=e" />
    <meta
      property="og:description"
      content="Description that will show in the preview"
    />
  </head>
  <body>
    this is a test page
  </body>
</html>

The HTML for the meta og:image tag has "/1234567.png?a=b&amp;c=e", which should decode to "/1234567.png?a=b&c=e".

I made a directory, put that file as "index.html" in it, put a random image named "1234567.png" in the directory too, I ran npx http-server . to host that directory through a local http server, and then I ran ngrok http 8080 to get a public URL that leads to that HTTP server. I then put the URL into https://www.linkedin.com/post-inspector/.

My local HTTP server's logs immediately showed the following, confirming that it correctly decoded:

[2022-02-23T23:33:31.661Z]  "GET /" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Apache-HttpClient +http://www.linkedin.com)"
[2022-02-23T23:33:32.181Z]  "GET /1234567.png?a=b&c=e" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Apache-HttpClient +http://www.linkedin.com)"

Additionally, the LinkedIn Post Inspector page correctly showed the detected image URL as https://a126-omitted-3fb7.ngrok.io/1234567.png?a=b&c=e.

It seems like LinkedIn is handling things correctly and working with the standard HTML-encoding behavior React is following, and that no change or workaround in React is necessary to work here with them. If there are other parts of LinkedIn that do have an issue, then the issue could be demonstrated with a test like this.

@apecollector
Copy link

Same issue, using og:image where the url uses query parameters, twitter sees these and sends them to my server which doesn't recognize &amp;query=value instead of &query=value.

@talentedandrew
Copy link

@apecollector I had the similar experience with renderToStaticMarkup method, which escapes html tags, new line etc. I used unescape from lodash to fix it.

_.unescape('&amp;query=value');

outputs

&query=value

@sc0ttdav3y
Copy link

sc0ttdav3y commented Sep 9, 2022

Hi all, I know this is a 4 year old issue and probably most of you have probably moved on, but I'm new to it and feel I must be missing something obvious.

I suspect we're all adding content security policies to our React apps by now, and those policies have a content value often with single quotes in them. I'm running Next.JS with static site generation, and their advice is to add it into /pages/_document.tsx. But this happens:

<meta httpEquiv="Content-Security-Policy" content="default-src 'self'" />

becomes

<meta httpEquiv="Content-Security-Policy" content="default-src &amp;self&amp;" />

What's the recommended way to do this in 2022?

@devilportez
Copy link

Hi, I have the same issue as @sc0ttdav3y has, but in my case single quotes are escaping with &#x27; (Next.JS 12.3.0, ssg).

@Macil
Copy link

Macil commented Sep 14, 2022

(I just tried the same thing with NextJS and the exact output I got is <meta http-equiv="Content-Security-Policy" content="default-src &#x27;self&#x27;"/>, consistent with #13838 (comment), so I'm going to assume the different value in #13838 (comment) was a copy-paste error of some kind.)

@sc0ttdav3y @devilportez It should be fine that the value is html-encoded like that. The browser will decode html entities when reading the html and interpret the value of the meta tag the same way as "default-src 'self'". You should see the CSP policy be in effect: if you make the page use something denied by the pollicy (like inline styles and scripts, which NextJS uses by default) then an error will appear in the console.

@BrodaNoel
Copy link

BrodaNoel commented Mar 14, 2023

Still happening in [email protected] and [email protected].

Kind of killing my SEO because I can't use:

<meta property="og:image" content="url-with-&" />

@aaronbarker
Copy link

I ran into this escaping issue and tried a hack from another issue I had also ran into. It seemed to have fixed the escaping for me. You get a few unnecessary <style></style> elements injected, but may be a better option for some than parsing the final content. YMMV

<style
  dangerouslySetInnerHTML={{
    __html: `</style>
      ${yourCode}
    <style>`,
  }}
></style>

@jimrandomh
Copy link

Wound up here investigating why a link-preview was missing its preview image on Twitter. The escaping turned out to be a red herring; Twitter's link-preview bot handles escaped URLs just fine. The problem was that it wasn't using our own mirror of the image, it was using a user-submitted image URL on pbs.twimg.com, which apparently blocks all bot traffic (including from Twitter's own link-preview bot).

@erickreutz
Copy link

Running into this issue as well using Next.js and generateMetadata <meta description="we're" /> becomes <meta description="we&#x27;re" />

@GabenGar
Copy link

GabenGar commented Oct 1, 2024

Issue with what? The consumer of this meta tag not parsing escaped HTML values correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests