Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Email addresses in HTML content are removed when sanitizing text coming from a plaintext email #126

Closed
istrasoft opened this issue Aug 23, 2017 · 7 comments

Comments

@istrasoft
Copy link

istrasoft commented Aug 23, 2017

When the string to sanitize comes from a plaintext email, such items are present in the original content :

blah blah

From: Mark <mailto:[email protected]>
Sent: Wednesday, August 16, 2017 19:47
To: John <[email protected]>
Subject: Re: Document Test

Hello John

If the email was a HTML email, the < and > around "<[email protected]>" are aleady escaped as < and > but if the email was plaintext, they are not.

In this specific case, the part <[email protected]> is considered to be an invalid HTML tag and is removed, along with all the following content from that point.

If option "Keep child nodes of removed elements" is chosen, then only these email tags are lost.

It would be great if after testing a tag against the whitelist, an additional test was made to attempt to match it to these two authorized and standard and safe instances.

@mganss
Copy link
Owner

mganss commented Aug 23, 2017

This is the same problem as #91. See there for a workaround.

@istrasoft
Copy link
Author

istrasoft commented Aug 24, 2017

Thanks @mganss !
However this specific case is an almost RFC-level standard occurence unlike "custom" html tags, so I thought maybe the project would support this in standard rather than a workaround :)

@mganss
Copy link
Owner

mganss commented Aug 24, 2017

I'd like HtmlSanitizer to "do one thing well" and that's sanitize HTML, so adding this would be outside of this scope. If you have something that's not HTML, you'll need to do preprocessing.

@istrasoft
Copy link
Author

istrasoft commented Aug 24, 2017

Indeed, makes sense.. Maybe your gist could be included in the distribution and accessed through an additional call or option/flag. Thanks for the workaround and quick replies :)

@mganss
Copy link
Owner

mganss commented Aug 24, 2017

I have added it to the Examples wiki page.

@istrasoft
Copy link
Author

Thanks a lot @mganss !

@sunitana
Copy link

Hi, Just FYI want to update how I am handling this issue.

Created a method which identifies if the tag to be removed is in email format.
sanitizer.RemovingTag += (sender, evt) => { if (IsValidMailAddress(evt.Tag)) //tag won't be removed if it is n email format { isValid = true; evt.Cancel = true; } else if (!invalidTags.ContainsKey(evt.Tag.TagName)) { //invalidTags.Add(evt.Tag.TagName, evt.Reason.ToString()); isValid = false; } };

public static bool IsValidMailAddress(AngleSharp.Dom.IElement emailAddress) { try { if (emailAddress.NodeName.ToLower().StartsWith("mailto:")) { System.Net.Mail.MailAddress mTo = new System.Net.Mail.MailAddress(emailAddress.NodeName.Substring("mailto:".Length, emailAddress.NodeName.Length - "mailto:".Length)); } else { System.Net.Mail.MailAddress m = new System.Net.Mail.MailAddress(emailAddress.NodeName); } return true; } catch (Exception Ex) { return false; } }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants