Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding the subject with iconv_mime_decode() #410

Open
freescout-helpdesk opened this issue May 31, 2023 · 12 comments
Open

Decoding the subject with iconv_mime_decode() #410

freescout-helpdesk opened this issue May 31, 2023 · 12 comments
Labels
enhancement New feature or request

Comments

@freescout-helpdesk
Copy link
Contributor

freescout-helpdesk commented May 31, 2023

Describe the bug

Currently Webklex/php-imap by default uses 'utf-8' decoder and \imap_utf8() function to decode subject:

if ($decoder === 'utf-8' && extension_loaded('imap')) {

This approach does not always decode subjects properly.

For example Subject: =?ISO-2022-JP?B?GyRCIXlCaBsoQjEzMhskQjlmISEhViUsITwlRyVzGyhCJhskQiUoJS8lOSVGJWolIiFXQGxMZ0U5JE4kPyRhJE4jURsoQiYbJEIjQSU1JW0lcyEhIVo3bjQpJSglLyU5JUYlaiUiISYlbyE8JS8hWxsoQg==?= will be decoded into garbled string:

�$B!yBh�(B132�$B9f!!!V%,!<%G%s�(B&�$B%(%/%9%F%j%"!W@lLgE9$N$?$a$N#Q�(B&�$B#A%5%m%s!!!Z7n4)%(%/%9%F%j%"!&%o!<%/![�(B

So it does not make much sense to use by default the approach which is not 100% reliable.

Some time ago we've switched to decoding the subject simply with iconv_mime_decode():

$value = iconv_mime_decode($value, ICONV_MIME_DECODE_CONTINUE_ON_ERROR, "UTF-8");

It works always and properly - freescout-help-desk/freescout#2965 (comment).

So maybe it worth making iconv_mime_decode() the default and the only approach for decoding the subject like this: https://github.com/freescout-helpdesk/freescout/blob/master/overrides/webklex/php-imap/src/Header.php#L208

Function:
https://github.com/freescout-helpdesk/freescout/blob/dist/app/Misc/Mail.php#L909 (decodeSubject())

Related issue: Webklex/laravel-imap#295

@Webklex Webklex added bug Something isn't working validating labels Jun 23, 2023
Webklex added a commit that referenced this issue Jun 23, 2023
@Webklex Webklex added enhancement New feature or request and removed bug Something isn't working validating labels Jun 23, 2023
@Webklex
Copy link
Owner

Webklex commented Jun 23, 2023

Hi @freescout-helpdesk ,
many thanks for your report and suggestion. I improved the decoder and added a new test case for the mentioned subject above.

Best regards & happy coding,

@nuernbergerA
Copy link
Contributor

@Webklex Attachment::decodeName() has the same problem

@Webklex
Copy link
Owner

Webklex commented Jun 27, 2023

Hi @nuernbergerA ,
thanks for your report. I just pushed a new test including your mentioned scenario but can't reproduce the described issue.
Are you using the latest release? If not, please update and try again. There's a chance that the problem might have already been fixed in a newer release.

If the issue persists, please head over to https://github.com/Webklex/php-imap/issues/new?assignees=&labels=&projects=&template=bug_report.md and provide as much information as possible. Every little bit, even if it might seem unrelated could help to solve the mystery :)

Best regards & happy coding,

@nuernbergerA
Copy link
Contributor

@Webklex i've opened a PR with a failing test for my issue ;)

@freescout-helpdesk
Copy link
Contributor Author

@Webklex i've opened a PR with a failing test for my issue ;)

We've checked. This our function https://github.com/freescout-helpdesk/freescout/blob/dist/app/Misc/Mail.php#L883C8-L883C8 also successfully works for decoding headers and attachment names including your example:

=?iso-8859-1?Q?2021=5FM=E4ngelliste=5F0819306.xlsx?= >> 2021_Mängelliste_0819306.xlsx

@Webklex Webklex reopened this Jun 28, 2023
@Webklex
Copy link
Owner

Webklex commented Jun 28, 2023

@nuernbergerA thanks for your input and your pull request. Please update to v5.5 and give it another try :)

@freescout-helpdesk how are you decoding attachment names / filenames? If you are relying on Attachment::decodeName(), please note that this if statement was reversed:
https://github.com/Webklex/php-imap/blob/5.5.0/src/Attachment.php#L296-L300
The fallback was used as default, which works often but can cause unexpected behavior (#421) in some cases :/

Best regards & happy coding,

@freescout-helpdesk
Copy link
Contributor Author

@freescout-helpdesk how are you decoding attachment names / filenames?

We are using the same our MailHelper::decodeSubject() function to decode attachment names - it works fine.

@jackwiy
Copy link

jackwiy commented Jul 4, 2024

@Webklex

$personal = '=?gb2312?B?z+O42yCDfN14u80=?=';
imap_mime_header_decode()
The result obtained by this method:��� �|�x��

public function testUTF8Content()
    {
        $personal = '=?gb2312?B?z+O42yCDfN14u80=?=';
        $personal1 = imap_mime_header_decode($personal);

        echo '<pre>';
        print_r(['personal1'=>$personal1]);

        $header = new Header('');
        $personalParts = $header->mime_header_decode($personal);
        $personal2 = '';
        foreach ($personalParts as $p) {
            print_r(['pText'=>$p->text]);
            $personal2 .= $header->convertEncoding($p->text, $header->getEncoding($p));
        }

        print_r(['personal2'=>$personal2]);
    }
图片 图片 图片

@jackwiy
Copy link

jackwiy commented Jul 5, 2024

public function testUTF8Content()
    {
        $personal = '=?gb2312?B?z+O42yCDfN14u80=?=';

        $personal2 = imap_utf8($personal);
        print_r(['personal2'=>$personal2]);

        $header = new Header('');
        $personalParts = $header->mime_header_decode($personal);
        $personal3 = '';
        foreach ($personalParts as $p) {
            print_r(['pText'=>$p->text]);
            //$personal3 .= $header->convertEncoding($p->text, $header->getEncoding($p));
            $personal3 .= iconv($header->getEncoding($p), 'UTF-8//IGNORE', $p->text);
        }

        print_r(['personal3'=>$personal3]);
    }
图片 图片 图片

@thin-k-design
Copy link
Contributor

thin-k-design commented Aug 16, 2024

I have a related problem. I received an email with an attachment, whose filename is "iso-8859-1''Elk%FCld%F6tt%20felsz%F3l%EDt%E1s%20ELK-F-24011370.pdf"

The filename gets decoded as Elk�ld�tt felsz�l�t�s ELK-F-24011370.pdf

After url decoding the string it should be converted to UTF-8:

            if (str_contains($name, "''")) {
                $parts = explode("''", $name);
                if (EncodingAliases::has($parts[0])) {
                    $name = implode("''", array_slice($parts, 1));
                }
            }
            [...]
            // check if $name is url encoded
            if (preg_match('/%[0-9A-F]{2}/i', $name)) {
                $name = urldecode($name);
            }

It should be something like:

           if (str_contains($name, "''")) {
               $parts = explode("''", $name);
               if (EncodingAliases::has($parts[0])) {
                   $charset = $parts[0];
                   $name = implode("''", array_slice($parts, 1));
               }
           }
           [...]
           // check if $name is url encoded
           if (preg_match('/%[0-9A-F]{2}/i', $name)) {
               $name = urldecode($name);
               if (isset($charset)) {
                   $name = mb_convert_encoding($name, 'UTF-8', $charset);
               }
           }

I am not sure of the implications, so I haven't done a pull request yet.

@thin-k-design
Copy link
Contributor

Actually it seems that the first part of the code is not equivalent to the RFC 2231 Specification, where a string can be represented like charset'language'encoded-text, in my case iso-8859-1''Elk%FCld%F6tt%20felsz%F3l%EDt%E1s%20ELK-F-24011370.pdf which translates in:

  • charset: iso-8859-1
  • language: empty
  • encoded-text: Elk%FCld%F6tt%20felsz%F3l%EDt%E1s%20ELK-F-24011370.pdf

So I think, this:

           if (str_contains($name, "''")) {
               $parts = explode("''", $name);
               if (EncodingAliases::has($parts[0])) {
                   $name = implode("''", array_slice($parts, 1));
               }
           }

may be replaced with:

            $parts = explode("'", $name, 3);
            if (EncodingAliases::has($parts[0])) {
                list($charset, $language, $name) = $parts;
            }

@thin-k-design
Copy link
Contributor

My pull request did not pass the following test:
AttachmentLongFilenameTest.php
self::assertEquals("f7b5181985862431bfc443d26e3af2371e20a0afd676eeb9b9595a26d42e0b73", hash("sha256", $attachment->filename));

The assertion in the test assumes that the output filename of iso-8859-15''%30%31%5F%41%A4%E0%E4%3F%3F%3F%3F%40%5A%2D%30%31%32%33%34%35%36%37%38%39%2D%71%77%65%72%74%79%75%69%6F%70%61%73%64%66%67%68%6A%6B%6C%7A%78%63%76%62%6E%6D%6F%70%71%72%73%74%75%76%7A%2D%30%31%32%33%34%35%36%37%38%39%2D%71%77%65%72%74%79%75%69%6F%70%61%73%64%66%67%68%6A%6B%6C%7A%78%63%76%62%6E%6D%6F%70%71%72%73%74%75%76%7A%2D%30%31%32%33%34%35%36%37%38%39%2D%71%77%65%72%74%79%75%69%6F%70%61%73%64%66%67%68%6A%6B%6C%7A%78%63%76%62%6E%6D%6F%70%71%72%73%74%75%76%7A%2E%74%78%74 is in its original encoding and not in utf-8.

Maybe I am missing something, but should this be the intended behaviour (leaving the original encoding as is), how can I get the encoding of the filename in the Attachment class, so I can convert it myself if needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants