Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

email.policy.EmailPolicy._fold() breaking multi-byte Unicode sequences #117313

Closed
davidmcnabnz opened this issue Mar 28, 2024 · 0 comments
Closed
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes topic-email topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@davidmcnabnz
Copy link

davidmcnabnz commented Mar 28, 2024

lines = value.splitlines()

I think it's problematic that the method email.policy.EmailPolicy._fold() relies on the generic str / bytes method .splitlines(), especially in an email-processing context where the "official" line ending is \r\n.

I'm one of many devs who also leniently recognise (regex) [\r\n]+ as a line break in emails. But I have no idea why all the other ending characters from other contexts are also used in a specific mail-manipulation context.

On the surface, .splitlines() seems a simple way to cover the case of a header value itself containing line endings.

However, in cases where a header value may contain multi-byte Unicode sequences, this causes breakage, because characters such as \x0C (which may potentially be part of a sequence) instead get treated as legacy ASCII 'form-feed', and deemed to be a line ending. This then breaks the sequence, which in turn, causes problems in the subsequent processing of the email message.

A specimen header (from real-world production traffic) which triggers this behaviour is:

b'Subject: P/L SEND : CARA-23PH00021,,   0xf2\x0C\xd8/FTEP'

Here, the \x0C is treated as a line-ending, so the trailing portion b'\xd8/FTEP' gets wrapped and indented on the next line.

To work around this in my networks, I've had to subclass email.policy.EmailPolicy, and override the method ._fold() to instead split only on CR/LFs, via

RE_EOL_STR = re.compile(r'[\r\n]+')
RE_EOL_BYTES = re.compile(rb'[\r\n]+')

...

class MyPolicy(email.policy.EmailPolicy):

    ...

    def _fold(self, name, value, refold_binary=False):
        """
        Need to override this from email.policy.EmailPolicy to stop it treating chars other than
        CR and LF as newlines
        :param name:
        :param value:
        :param refold_binary:
        :return:
        """
        if hasattr(value, 'name'):
            return value.fold(policy=self)
        maxlen = self.max_line_length if self.max_line_length else sys.maxsize

        # this is from the library version, and it improperly breaks on chars like 0x0c, treating
        # them as 'form feed' etc.
        # we need to ensure that only CR/LF is used as end of line
        #lines = value.splitlines()

        # this is a workaround which splits only on CR/LF characters
        if refold_binary:
            lines = RE_EOL_BYTES.split(value)
        else:
            lines = RE_EOL_STR.split(value)

        refold = (self.refold_source == 'all' or
                  self.refold_source == 'long' and
                    (lines and len(lines[0])+len(name)+2 > maxlen or
                     any(len(x) > maxlen for x in lines[1:])))
        if refold or refold_binary and _has_surrogates(value):
            return self.header_factory(name, ''.join(lines)).fold(policy=self)
        return name + ': ' + self.linesep.join(lines) + self.linesep

Can the maintainers of this class please advise with their thoughts?

Given that RFC822 and related standards specify that the "official" line ending is \r\n, is there any reason to catch everything else that may also be considered in other string contexts to constitute a line ending?

Linked PRs

@serhiy-storchaka serhiy-storchaka added topic-unicode topic-email type-bug An unexpected behavior, bug, or error 3.11 only security fixes 3.12 bugs and security fixes 3.13 bugs and security fixes labels Mar 28, 2024
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Mar 29, 2024
…d line separators

Only treat '\n', '\r' and '\r\n' as line separators in re-folding the email
messages.  Preserve control characters '\v', '\f', '\x1c', '\x1d' and '\x1e'
and Unicode line separators '\x85', '\u2028' and '\u2029' as is.
serhiy-storchaka added a commit that referenced this issue Apr 17, 2024
… separators (GH-117369)

Only treat '\n', '\r' and '\r\n' as line separators in re-folding the email
messages.  Preserve control characters '\v', '\f', '\x1c', '\x1d' and '\x1e'
and Unicode line separators '\x85', '\u2028' and '\u2029' as is.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Apr 17, 2024
…d line separators (pythonGH-117369)

Only treat '\n', '\r' and '\r\n' as line separators in re-folding the email
messages.  Preserve control characters '\v', '\f', '\x1c', '\x1d' and '\x1e'
and Unicode line separators '\x85', '\u2028' and '\u2029' as is.
(cherry picked from commit aec1dac)

Co-authored-by: Serhiy Storchaka <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Apr 17, 2024
…d line separators (pythonGH-117369)

Only treat '\n', '\r' and '\r\n' as line separators in re-folding the email
messages.  Preserve control characters '\v', '\f', '\x1c', '\x1d' and '\x1e'
and Unicode line separators '\x85', '\u2028' and '\u2029' as is.
(cherry picked from commit aec1dac)

Co-authored-by: Serhiy Storchaka <[email protected]>
@erlend-aasland erlend-aasland removed the 3.11 only security fixes label Apr 17, 2024
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…d line separators (pythonGH-117369)

Only treat '\n', '\r' and '\r\n' as line separators in re-folding the email
messages.  Preserve control characters '\v', '\f', '\x1c', '\x1d' and '\x1e'
and Unicode line separators '\x85', '\u2028' and '\u2029' as is.
serhiy-storchaka added a commit that referenced this issue Apr 17, 2024
…rd line separators (GH-117369) (GH-117971)

Only treat '\n', '\r' and '\r\n' as line separators in re-folding the email
messages.  Preserve control characters '\v', '\f', '\x1c', '\x1d' and '\x1e'
and Unicode line separators '\x85', '\u2028' and '\u2029' as is.
(cherry picked from commit aec1dac)

Co-authored-by: Serhiy Storchaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes topic-email topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants