email.policy.EmailPolicy._fold() breaking multi-byte Unicode sequences #117313

davidmcnabnz · 2024-03-28T05:09:41Z

Line 208 in eefff68

lines = value.splitlines()

I think it's problematic that the method email.policy.EmailPolicy._fold() relies on the generic str / bytes method .splitlines(), especially in an email-processing context where the "official" line ending is \r\n.

I'm one of many devs who also leniently recognise (regex) [\r\n]+ as a line break in emails. But I have no idea why all the other ending characters from other contexts are also used in a specific mail-manipulation context.

On the surface, .splitlines() seems a simple way to cover the case of a header value itself containing line endings.

However, in cases where a header value may contain multi-byte Unicode sequences, this causes breakage, because characters such as \x0C (which may potentially be part of a sequence) instead get treated as legacy ASCII 'form-feed', and deemed to be a line ending. This then breaks the sequence, which in turn, causes problems in the subsequent processing of the email message.

A specimen header (from real-world production traffic) which triggers this behaviour is:

b'Subject: P/L SEND : CARA-23PH00021,,   0xf2\x0C\xd8/FTEP'

Here, the \x0C is treated as a line-ending, so the trailing portion b'\xd8/FTEP' gets wrapped and indented on the next line.

To work around this in my networks, I've had to subclass email.policy.EmailPolicy, and override the method ._fold() to instead split only on CR/LFs, via

RE_EOL_STR = re.compile(r'[\r\n]+')
RE_EOL_BYTES = re.compile(rb'[\r\n]+')

...

class MyPolicy(email.policy.EmailPolicy):

    ...

    def _fold(self, name, value, refold_binary=False):
        """
        Need to override this from email.policy.EmailPolicy to stop it treating chars other than
        CR and LF as newlines
        :param name:
        :param value:
        :param refold_binary:
        :return:
        """
        if hasattr(value, 'name'):
            return value.fold(policy=self)
        maxlen = self.max_line_length if self.max_line_length else sys.maxsize

        # this is from the library version, and it improperly breaks on chars like 0x0c, treating
        # them as 'form feed' etc.
        # we need to ensure that only CR/LF is used as end of line
        #lines = value.splitlines()

        # this is a workaround which splits only on CR/LF characters
        if refold_binary:
            lines = RE_EOL_BYTES.split(value)
        else:
            lines = RE_EOL_STR.split(value)

        refold = (self.refold_source == 'all' or
                  self.refold_source == 'long' and
                    (lines and len(lines[0])+len(name)+2 > maxlen or
                     any(len(x) > maxlen for x in lines[1:])))
        if refold or refold_binary and _has_surrogates(value):
            return self.header_factory(name, ''.join(lines)).fold(policy=self)
        return name + ': ' + self.linesep.join(lines) + self.linesep

Can the maintainers of this class please advise with their thoughts?

Given that RFC822 and related standards specify that the "official" line ending is \r\n, is there any reason to catch everything else that may also be considered in other string contexts to constitute a line ending?

Linked PRs

The text was updated successfully, but these errors were encountered:

…d line separators Only treat '\n', '\r' and '\r\n' as line separators in re-folding the email messages. Preserve control characters '\v', '\f', '\x1c', '\x1d' and '\x1e' and Unicode line separators '\x85', '\u2028' and '\u2029' as is.

… separators (GH-117369) Only treat '\n', '\r' and '\r\n' as line separators in re-folding the email messages. Preserve control characters '\v', '\f', '\x1c', '\x1d' and '\x1e' and Unicode line separators '\x85', '\u2028' and '\u2029' as is.

…d line separators (pythonGH-117369) Only treat '\n', '\r' and '\r\n' as line separators in re-folding the email messages. Preserve control characters '\v', '\f', '\x1c', '\x1d' and '\x1e' and Unicode line separators '\x85', '\u2028' and '\u2029' as is. (cherry picked from commit aec1dac) Co-authored-by: Serhiy Storchaka <[email protected]>

…d line separators (pythonGH-117369) Only treat '\n', '\r' and '\r\n' as line separators in re-folding the email messages. Preserve control characters '\v', '\f', '\x1c', '\x1d' and '\x1e' and Unicode line separators '\x85', '\u2028' and '\u2029' as is.

…rd line separators (GH-117369) (GH-117971) Only treat '\n', '\r' and '\r\n' as line separators in re-folding the email messages. Preserve control characters '\v', '\f', '\x1c', '\x1d' and '\x1e' and Unicode line separators '\x85', '\u2028' and '\u2029' as is. (cherry picked from commit aec1dac) Co-authored-by: Serhiy Storchaka <[email protected]>

serhiy-storchaka added topic-unicode topic-email type-bug An unexpected behavior, bug, or error 3.11 only security fixes 3.12 bugs and security fixes 3.13 bugs and security fixes labels Mar 28, 2024

bedevere-app bot mentioned this issue Mar 29, 2024

gh-117313: Fix re-folding email messages containing non-standard line separators #117369

Merged

This was referenced Apr 17, 2024

[3.12] gh-117313: Fix re-folding email messages containing non-standard line separators (GH-117369) #117971

Merged

[3.11] gh-117313: Fix re-folding email messages containing non-standard line separators (GH-117369) #117972

Closed

serhiy-storchaka closed this as completed Apr 17, 2024

erlend-aasland removed the 3.11 only security fixes label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

email.policy.EmailPolicy._fold() breaking multi-byte Unicode sequences #117313

email.policy.EmailPolicy._fold() breaking multi-byte Unicode sequences #117313

davidmcnabnz commented Mar 28, 2024 •

edited by bedevere-app bot

Loading

email.policy.EmailPolicy._fold() breaking multi-byte Unicode sequences #117313

email.policy.EmailPolicy._fold() breaking multi-byte Unicode sequences #117313

Comments

davidmcnabnz commented Mar 28, 2024 • edited by bedevere-app bot Loading

Linked PRs

davidmcnabnz commented Mar 28, 2024 •

edited by bedevere-app bot

Loading