Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preview text / snippet #1001

Closed
SmbatChilingaryan opened this issue Mar 22, 2020 · 18 comments
Closed

Preview text / snippet #1001

SmbatChilingaryan opened this issue Mar 22, 2020 · 18 comments
Labels
enhancement New feature or request

Comments

@SmbatChilingaryan
Copy link

SmbatChilingaryan commented Mar 22, 2020

Issue related with Preview text.
I need to receive messages from Gmail and all is ok except Preview text. The Preview text sometimes conteins Html tags ,sometimes links and sometimes abnormal characters which is not correct. I compared previewText result with snippet of Google api and the snippet of Google Api is exactly the same which I can see in my mail list.
Can you explain please is it normal behavior that PreviewText conteins Html tegs, links and some abnormal characters?
How can I receive the snippet?

List<Message> messages= new List<Message>();
using (var imap = new ImapClient())
{
    imap.Connect("imap.gmail.com", 993, SecureSocketOptions.SslOnConnect);
    imap.Authenticate(username, Password);
    var gmailFolder = imap.GetFolder(folderName);
    gmailFolder.Open(FolderAccess.ReadWrite);
    var msg = gmailFolder.Fetch(0, -1, MessageSummaryItems.All |
        MessageSummaryItems.References |
        MessageSummaryItems.UniqueId |
        MessageSummaryItems.PreviewText |
        MessageSummaryItems.GMailMessageId |
        MessageSummaryItems.GMailThreadId);

    foreach(var item in msg)
    {
        var message = new Message()
        {
            snippet = item.PreviewText 
        }
        messages.add(message) 
    }

    imap.Disconnect(true);
}

actual recult
"snippet": " <html xmlns="http://www.w3.org/1999/xhtml\"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <meta http-equ",

"snippet": "Hi Smbat,\r\n\r\nArman Karapetyan invited you to like Aleeque.\r\n\r\nIf you like Aleeque follow the link below:\r\nhttps://www.facebook.com/n/?pag2F&id=1045999611810&ori=page_invite&ext=1587243ash=AeT4-qrU_ustQegh&ref=1584649",

expected result
the string that is same that in Gmail list

Thank you

@jstedfast
Copy link
Owner

The IMAP protocol does not contain a command that will give you a preview snippet of the message.

What MailKit does is to implement the suggestion I gave to some developers who kept asking me "how can I get some preview text to display to the user in a message list like Outlook or GMail does?"

What I told them to do was to first FETCH the BODYSTRUCTURE of the message, locate the main message body, and then to issue a second FETCH request for BODY.PEEK[${part-specifier}]<0.256> (which grabs the first 256 bytes of the message body).

After getting tired of answering this question, I decided to implement this as a feature of MailKit and that is what you are seeing.

So yes, unfortunately, that means that sometimes you see raw HTML tags.

If you've got a better idea for obtaining this, I would love to hear it because I agree that this solution isn't ideal. I just don't know of a better way.

@jstedfast jstedfast added the question A question about how to do something label Mar 23, 2020
@jstedfast
Copy link
Owner

Out of curiosity, I just opened up GMail and looked through a few messages to see how much text MailKit would have to download in order to get the preview text that GMail displays for them. In one sample, MailKit would need to download 12K (message body is HTML) in order to extract the text that I could see in GMail's message list view. That's just impractical.

Looking at my iPhone's Mail.app, it seems to do the same. Holy cow.

Well, I guess the solution for you to do, then, is to download the entire message and then screen-grab the body text and display that as the preview text.

@SmbatChilingaryan
Copy link
Author

Thank you for your quick and valuable response.
I will try to filter previewText from Html tags by using RegEx.
Thank you.

@jstedfast
Copy link
Owner

jstedfast commented Mar 24, 2020

FWIW, I'm working on fixing MimeKit's HtmlTokenizer to make this possible w/ truncated data (which is what you get if you don't grab the full stream).

I might try to make a public class for this in MimeKit or MailKit, just not sure how to expose the API yet.

@SmbatChilingaryan
Copy link
Author

SmbatChilingaryan commented Mar 24, 2020

You are cool man.
Just FYI, in PreviewText can appear Html tags, redundant symbols and links.

@jstedfast
Copy link
Owner

jstedfast commented Mar 25, 2020

Step 1 has been to add the hooks I need into the HtmlTokenizer which I have now: jstedfast/HtmlKit@ec07b51

Then I noticed that the HTML entity decoder was terribly slow, so I optimized that: jstedfast/HtmlKit@10d010c

And based on my research so far... it seems that GMail generates its preview text based on the first 16K of the message body text.

Then I have a few classes that I'm working on that look like this:

using System;
using System.IO;
using System.Text;

using MimeKit.Utils;

namespace MimeKit.Text {
	/// <summary>
	/// An abstract class for generating a text preview of a message.
	/// </summary>
	/// <remarks>
	/// An abstract class for generating a text preview of a message.
	/// </remarks>
	public abstract class TextPreviewer
	{
		int maximumPreviewLength;

		/// <summary>
		/// Initializes a new instance of the <see cref="TextPreviewer"/> class.
		/// </summary>
		/// <remarks>
		/// Initializes a new instance of the <see cref="TextPreviewer"/> class.
		/// </remarks>
		protected TextPreviewer ()
		{
			maximumPreviewLength = 100;
		}

		/// <summary>
		/// Get the input format.
		/// </summary>
		/// <remarks>
		/// Gets the input format.
		/// </remarks>
		/// <value>The input format.</value>
		public abstract TextFormat InputFormat {
			get;
		}

		/// <summary>
		/// Get or set the maximum text preview length.
		/// </summary>
		/// <remarks>
		/// Gets or sets the maximum text preview length.
		/// </remarks>
		/// <value>The maximum text preview length.</value>
		/// <exception cref="System.ArgumentOutOfRangeException">
		/// <paramref name="value">is less than <c>1</c> or greater than <c>1024</c>.</paramref>
		/// </exception>
		public int MaximumPreviewLength {
			get { return maximumPreviewLength; }
			set {
				if (value < 1 || value > 1024)
					throw new ArgumentOutOfRangeException (nameof (value));

				maximumPreviewLength = value;
			}
		}

		/// <summary>
		/// Get a text preview of the text part.
		/// </summary>
		/// <remarks>
		/// Gets a text preview of the text part.
		/// </remarks>
		/// <param name="body">The text part.</param>
		/// <returns>A string representing a shortened preview of the original text.</returns>
		/// <exception cref="System.ArgumentNullException">
		/// <paramref name="body"/> is <c>null</c>.
		/// </exception>
		public static string GetPreviewText (TextPart body)
		{
			if (body == null)
				throw new ArgumentNullException (nameof (body));

			if (body.Content == null)
				return string.Empty;

			var encoding = body.ContentType.CharsetEncoding;

			if (encoding == null) {
				using (var content = body.Content.Open ()) {
					if (!CharsetUtils.TryGetBomEncoding (content, out encoding))
						encoding = CharsetUtils.UTF8;
				}
			}

			using (var content = body.Content.Open ()) {
				TextPreviewer previewer;

				if (body.IsHtml)
					previewer = new HtmlTextPreviewer ();
				else
					previewer = new PlainTextPreviewer ();

				try {
					return previewer.GetPreviewText (content, encoding);
				} catch (DecoderFallbackException) {
					return previewer.GetPreviewText (content, CharsetUtils.Latin1);
				}
			}
		}

		/// <summary>
		/// Get a text preview of a string of text.
		/// </summary>
		/// <remarks>
		/// Gets a text preview of a string of text.
		/// </remarks>
		/// <param name="text">The original text.</param>
		/// <returns>A string representing a shortened preview of the original text.</returns>
		/// <exception cref="System.ArgumentNullException">
		/// <paramref name="text"/> is <c>null</c>.
		/// </exception>
		public virtual string GetPreviewText (string text)
		{
			if (text == null)
				throw new ArgumentNullException (nameof (text));

			using (var reader = new StringReader (text))
				return GetPreviewText (reader);
		}

		/// <summary>
		/// Get a text preview of a stream of text in the specified charset.
		/// </summary>
		/// <remarks>
		/// Get a text preview of a stream of text in the specified charset.
		/// </remarks>
		/// <param name="stream">The original text stream.</param>
		/// <param name="charset">The charset encoding of the stream.</param>
		/// <returns>A string representing a shortened preview of the original text.</returns>
		/// <exception cref="System.ArgumentNullException">
		/// <para><paramref name="stream"/> is <c>null</c>.</para>
		/// <para>-or-</para>
		/// <para><paramref name="charset"/> is <c>null</c>.</para>
		/// </exception>
		public virtual string GetPreviewText (Stream stream, string charset)
		{
			if (stream == null)
				throw new ArgumentNullException (nameof (stream));

			if (charset == null)
				throw new ArgumentNullException (nameof (charset));

			Encoding encoding;

			try {
				encoding = CharsetUtils.GetEncoding (charset);
			} catch (NotSupportedException) {
				encoding = CharsetUtils.UTF8;
			}

			return GetPreviewText (stream, encoding);
		}

		/// <summary>
		/// Get a text preview of a stream of text in the specified encoding.
		/// </summary>
		/// <remarks>
		/// Get a text preview of a stream of text in the specified encoding.
		/// </remarks>
		/// <param name="stream">The original text stream.</param>
		/// <param name="charset">The encoding of the stream.</param>
		/// <returns>A string representing a shortened preview of the original text.</returns>
		/// <exception cref="System.ArgumentNullException">
		/// <para><paramref name="stream"/> is <c>null</c>.</para>
		/// <para>-or-</para>
		/// <para><paramref name="encoding"/> is <c>null</c>.</para>
		/// </exception>
		public virtual string GetPreviewText (Stream stream, Encoding encoding)
		{
			if (stream == null)
				throw new ArgumentNullException (nameof (stream));

			if (encoding == null)
				throw new ArgumentNullException (nameof (encoding));

			using (var reader = new StreamReader (stream, encoding, false, 4096, true))
				return GetPreviewText (reader);
		}

		/// <summary>
		/// Get a text preview of a stream of text.
		/// </summary>
		/// <remarks>
		/// Gets a text preview of a stream of text.
		/// </remarks>
		/// <param name="reader">The original text stream.</param>
		/// <returns>A string representing a shortened preview of the original text.</returns>
		/// <exception cref="System.ArgumentNullException">
		/// <paramref name="reader"/> is <c>null</c>.
		/// </exception>
		public abstract string GetPreviewText (TextReader reader);
	}
}
using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;

namespace MimeKit.Text {
	/// <summary>
	/// A text previewer for HTML content.
	/// </summary>
	/// <remarks>
	/// A text previewer for HTML content.
	/// </remarks>
	public class HtmlTextPreviewer : TextPreviewer
	{
		/// <summary>
		/// Initializes a new instance of the <see cref="HtmlTextPreviewer"/> class.
		/// </summary>
		/// <remarks>
		/// Creates a new previewer for HTML.
		/// </remarks>
		public HtmlTextPreviewer ()
		{
		}

		/// <summary>
		/// Get the input format.
		/// </summary>
		/// <remarks>
		/// Gets the input format.
		/// </remarks>
		/// <value>The input format.</value>
		public override TextFormat InputFormat {
			get { return TextFormat.Html; }
		}

		static bool IsWhiteSpace (char c)
		{
			return char.IsWhiteSpace (c) || (c >= 0x200B && c <= 0x200D);
		}

		static bool Append (char[] preview, ref int previewLength, string value, ref bool lwsp)
		{
			int i;

			for (i = 0; i < value.Length && previewLength < preview.Length; i++) {
				if (IsWhiteSpace (value[i])) {
					if (!lwsp) {
						preview[previewLength++] = ' ';
						lwsp = true;
					}
				} else {
					preview[previewLength++] = value[i];
					lwsp = false;
				}
			}

			if (i < value.Length) {
				if (lwsp)
					previewLength--;

				preview[previewLength - 1] = '\u2026';
				lwsp = false;
				return true;
			}

			return false;
		}

		sealed class HtmlTagContext
		{
			public HtmlTagContext (HtmlTagId id)
			{
				TagId = id;
			}

			public HtmlTagId TagId {
				get;
			}

			public int ListIndex {
				get; set;
			}

			public bool SuppressInnerContent {
				get; set;
			}
		}

		static bool SuppressContent (IList<HtmlTagContext> stack)
		{
			int lastIndex = stack.Count - 1;

			return lastIndex >= 0 && stack[lastIndex].SuppressInnerContent;
		}

		HtmlTagContext GetListItemContext (IList<HtmlTagContext> stack)
		{
			for (int i = stack.Count; i > 0; i--) {
				var ctx = stack[i - 1];

				if (ctx.TagId == HtmlTagId.OL || ctx.TagId == HtmlTagId.UL)
					return ctx;
			}

			return null;
		}

		static void Pop (IList<HtmlTagContext> stack, HtmlTagId id)
		{
			for (int i = stack.Count; i > 0; i--) {
				if (stack[i - 1].TagId == id) {
					stack.RemoveAt (i - 1);
					break;
				}
			}
		}

		static bool ShouldSuppressInnerContent (HtmlTagId id)
		{
			switch (id) {
			case HtmlTagId.OL:
			case HtmlTagId.Script:
			case HtmlTagId.Style:
			case HtmlTagId.Table:
			case HtmlTagId.TBody:
			case HtmlTagId.THead:
			case HtmlTagId.TR:
			case HtmlTagId.UL:
				return true;
			default:
				return false;
			}
		}

		/// <summary>
		/// Get a text preview of a stream of text.
		/// </summary>
		/// <remarks>
		/// Gets a text preview of a stream of text.
		/// </remarks>
		/// <param name="reader">The original text stream.</param>
		/// <returns>A string representing a shortened preview of the original text.</returns>
		/// <exception cref="System.ArgumentNullException">
		/// <paramref name="reader"/> is <c>null</c>.
		/// </exception>
		public override string GetPreviewText (TextReader reader)
		{
			if (reader == null)
				throw new ArgumentNullException (nameof (reader));

			var tokenizer = new HtmlTokenizer (reader) { IgnoreTruncatedTags = true };
			var preview = new char[MaximumPreviewLength];
			var stack = new List<HtmlTagContext> ();
			var prefix = string.Empty;
			int previewLength = 0;
			HtmlTagContext ctx;
			HtmlAttribute attr;
			bool body = false;
			bool full = false;
			bool lwsp = true;
			HtmlToken token;

			while (!full && tokenizer.ReadNextToken (out token)) {
				switch (token.Kind) {
				case HtmlTokenKind.Tag:
					var tag = (HtmlTagToken) token;

					if (!tag.IsEndTag) {
						if (body) {
							switch (tag.Id) {
							case HtmlTagId.Image:
								if ((attr = tag.Attributes.FirstOrDefault (x => x.Id == HtmlAttributeId.Alt)) != null) {
									full = Append (preview, ref previewLength, prefix + attr.Value, ref lwsp);
									prefix = string.Empty;
								}
								break;
							case HtmlTagId.LI:
								if ((ctx = GetListItemContext (stack)) != null) {
									if (ctx.TagId == HtmlTagId.OL)
										full = Append (preview, ref previewLength, $" {++ctx.ListIndex}. ", ref lwsp);
									else
										full = Append (preview, ref previewLength, " \u2022 ", ref lwsp);
									prefix = string.Empty;
								}
								break;
							case HtmlTagId.Br:
							case HtmlTagId.P:
								prefix = " ";
								break;
							}

							if (!tag.IsEmptyElement) {
								ctx = new HtmlTagContext (tag.Id) {
									SuppressInnerContent = ShouldSuppressInnerContent (tag.Id)
								};
								stack.Add (ctx);
							}
						} else if (tag.Id == HtmlTagId.Body && !tag.IsEmptyElement) {
							body = true;
						}
					} else if (tag.Id == HtmlTagId.Body) {
						stack.Clear ();
						body = false;
					} else {
						Pop (stack, tag.Id);
					}
					break;
				case HtmlTokenKind.Data:
					if (body && !SuppressContent (stack)) {
						var data = (HtmlDataToken) token;

						full = Append (preview, ref previewLength, prefix + data.Data, ref lwsp);
						prefix = string.Empty;
					}
					break;
				}
			}

			if (lwsp && previewLength > 0)
				previewLength--;

			return new string (preview, 0, previewLength);
		}
	}
}

@jstedfast
Copy link
Owner

Another update:

It seems that GMail will generate up to about 110 characters worth of "preview snippet" text, but I'm not sure if that's just because that's about the maximum that will fit on my screen and if a wider monitor would get me more text or not.

If I provide this as a class in MimeKit, though, I can't take into consideration the widths of glyphs in the font because who knows what system developers will be using MimeKit/MailKit on to render their messages, so basing it on rendering bounds is just not practical.

... But 110 characters seems reasonable, so I think I'll do that.

@SmbatChilingaryan
Copy link
Author

I have checked length of snippet via Gmail public Api and the max length of the snippet is 230 characters.
On the other hand, if the snippet contains only text and without redundant Html tags , links and symbols we can also set "..." on the end of text.
So 110 or may be 230 is normal for this field.

Thank you.

jstedfast added a commit to jstedfast/MimeKit that referenced this issue Mar 26, 2020
@jstedfast
Copy link
Owner

Ah, that was a good idea to check the Gmail API...

I've made the previewer configurable (up to 1024 characters long, at least), but maybe I'll bump it up to 230 (currently defaults to 110).

The above commit adds the necessary classes to MimeKit to generate the snippets from a TextPart, string, Stream or TextReader.

Next step is to update ImapFolder.cs to use those classes.

@jstedfast jstedfast reopened this Mar 26, 2020
@jstedfast
Copy link
Owner

I remembered tonight that there was, at one point, talk of an IMAP extension for this so I looked it up and found it. It was an extension called SNIPPET for a while, but it looks like the latest revision of the draft spec is now calling them PREVIEWs:

https://tools.ietf.org/html/draft-ietf-extra-imap-fetch-preview-07

@jstedfast
Copy link
Owner

They suggest limiting PREVIEW text to 200 characters but a max of 256.

Based on that, I'll probably just go with the 230 recommendation that you suggested based on GMail's API.

@jstedfast
Copy link
Owner

OK, bumped the default to 230 as well, now.

@jstedfast jstedfast added enhancement New feature or request and removed question A question about how to do something labels Mar 28, 2020
@SmbatChilingaryan
Copy link
Author

That's cool. But how can I use the new Preview Text.
Now my steps is following
List messages= new List();
using (var imap = new ImapClient())
{
imap.Connect("imap.gmail.com", 993, SecureSocketOptions.SslOnConnect);
imap.Authenticate(username, Password);
var gmailFolder = imap.GetFolder(folderName);
gmailFolder.Open(FolderAccess.ReadWrite);
var msg = gmailFolder.Fetch(0, -1, MessageSummaryItems.All |
MessageSummaryItems.References |
MessageSummaryItems.UniqueId |
MessageSummaryItems.PreviewText |
MessageSummaryItems.GMailMessageId |
MessageSummaryItems.GMailThreadId);

foreach(var item in msg)
{
    var message = new Message()
    {
        snippet = item.PreviewText 
    }
    messages.add(message) 
}

imap.Disconnect(true);

}

but PreviewText from MessageSummary is still have Html tags.
Can you provide the steps, how can I get PreviewText without Html tags.

Thank you.

@jstedfast
Copy link
Owner

I haven't published any packages with the PreviewText fixes yet, so you'd have to build from source.

Even the MyGet packages won't have this feature yet because the build requires a newer version of MimeKit to be released with the PreviewText feature.

@SmbatChilingaryan
Copy link
Author

So the PreviewText feature will present in the new MimeKit Version may be 2.5.3 version?

@jstedfast
Copy link
Owner

It'll probably be 2.6.0

@jstedfast
Copy link
Owner

I'll also be releasing MailKit 2.6.0 at about the same time and when I do, the code you are currently using should work fine.

@SmbatChilingaryan
Copy link
Author

Ok . Thank you a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants