Translation tooltip images have extra encoding in paths #81

turnbullerin · 2022-07-21T21:33:38Z

Hi!

I'm setting up ERDDAP behind an nginx proxy and using the sub module to rewrite the HTML content to have the proper URLs for our host (mostly because I'm planning a federated install at some point in the future and will only have one host name, so I'm doing something like http://publichost/1/erddap --> http://erddap1:8080/erddap, http://publichost/2/erddap --> http://erddap2:8080/erddap).

One issue I've noticed with this is that some links are getting very encoded to the point where it's hard to make good rewrite rules (notably where there's a tooltip and an image) - this seems to be the doing of XML.encodeAsHTMLAttribute() which is converting everything that isn't a digit or letter into an entity. The result is a URL like http://erddap1.org:8080/erddap/index.html is actually http:&x2f;&x2f;erddap1&x2e;8080&x2f;erddap&x2f;index&x2e;png which makes using sub more challenging.

It would be nice for these URLs to be consistent with the rest of the URLs on the page (which are not encoded in this fashion). I think it is safe to include :/. in the characters that encodeAsHTMLAttribute() would not encode at least, and generally only &, quotation marks, and those characters not supported in your character encoding need encoding (so codepoints > 255)

BobSimons · 2022-07-22T14:12:09Z

I admit that some of the text (e.g., urls) in HTML attributes are excessively encoded.

However, I used to have a method (I think it was called minimallyEncodeAsHTMLAttribute()) which encoded a few more characters than what you suggested to minimally encode text that was stored as HTML attributes. However, that is insufficient for security reasons, notably the danger or cross-site scripting (XSS). See
https://wonko.com/post/html-escaping and
https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet#Output_Encoding_Rules_Summary [that's now out-of-date, perhaps you can find the newer reference].

You can say, "yeah, but surely you don't need to encode that aggressively". But I think the problem was that ERDDAP was being flagged by security scans because of the possibility of XSS vulnerabilities. The mere presence of certain characters (characters you wouldn't think to be trouble) was flagged.

So the situation is not a simple as one might expect and the safe thing was to aggressively encode the attributes (which has not caused any problems for ~8 years, until now). I suspect the happy medium is somewhere between what minimallyEncodeAsHTMLAttribute did and the current system, but it is exceedingly hard to know what that set of characters that need to be encoded (to avoid trouble with security scans) is.

So, I will try to less aggressively encode those particular urls so the problem goes away for you.

You say "it's hard to make good rewrite rules", but encodeAsHTMLAttributes just has one rule: if the character isn't a letter or a digit, then encode it as &#xhh; (where hh is the hexadecimal version of the character number). That is a legitimate encoding of the HTML attribute. So it seems like a solution for you is simply to decode those HTML attributes via the inverse of that rule. This is a subset of the standard way of encoding HTML attributes. You'll always have to deal with the possibility of some characters being encoded this way. I don't know what tools you have at your disposal in NGINX, but hopefully you can write the inverse of that rule if a built in function doesn't already exist.

I admit I don't quite understand the problem you face. After all, when the browser reads the HTML, one of the first things it does is decode the attributes. It's not like a browser is ever going to redirect anyone to http:&x2f;&x2f;erddap1&x2e;8080&x2f;erddap&x2f;index&x2e;html. User's get redirected to the decoded URL. So the problem is that your system (NGINX?) isn't decoding the attribute, which is something it needs to do in any case. The current encoding may be excessive, but it isn't illegal or incorrect.

BobSimons · 2022-07-22T14:32:28Z

You said "it's hard to make good rewrite rules", but you didn't say what tools you have at your disposal. Since you say it's hard,
I'm guessing it isn't a programming language and uses regular expressions (regex's) instead.
I'm guessing you're trying to match the stated URL and replace it with something else, so the real problem is finding the first part of the URL.
If I'm guessing correctly, can't you just use a capture group
(variant1|variant2|variant3) //but with any number of variants
to match the different forms of the first part of the URL?, e.g.,
(http://my.domain:8080/|http&x3a;&x2f;&x2f;my&x2e;domain&x3a;8080&x2f;)
?
I could easily be wrong about your situation. If so, please tell me more of the details of what you need to do and what tools you have (e.g., regex's).

BobSimons · 2022-07-22T17:38:59Z

I found the updated link for the OWASP Cheatsheets Book:
See https://owasp.org/www-pdf-archive/OWASP_Cheatsheets_Book.pdf
page 188, section 25.4, says
Encoding Type: HTML Attribute Encoding
Encoding Mechanism:
Except for alphanumeric characters, escape all characters with the HTML Entity &#xHH; format,
including spaces. (HH = Hex Value)"

So my current method is following the OWASP advice exactly.
I think the security scan report pointed me to this originally, as I wouldn't have found it on my own.
So I am disinclined to change the encodeAsHTMLAttribute() method in my code, as it is likely that this will just generate warnings/errors on the security scan report.

It would be great if we could solve the problem you face via a change in your NGINX configuration.

turnbullerin changed the title ~~Translation tooltip image breaks with proxy~~ Translation tooltip images have extra encoding in paths Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translation tooltip images have extra encoding in paths #81

Translation tooltip images have extra encoding in paths #81

turnbullerin commented Jul 21, 2022 •

edited

Loading

BobSimons commented Jul 22, 2022

BobSimons commented Jul 22, 2022

BobSimons commented Jul 22, 2022

Translation tooltip images have extra encoding in paths #81

Translation tooltip images have extra encoding in paths #81

Comments

turnbullerin commented Jul 21, 2022 • edited Loading

BobSimons commented Jul 22, 2022

BobSimons commented Jul 22, 2022

BobSimons commented Jul 22, 2022

turnbullerin commented Jul 21, 2022 •

edited

Loading