Plan for upcoming 3x memory improvement PR (and a few others) #296

lemon24 · 2021-12-18T20:09:05Z

I'm using feedparser heavily in reader, a feed reader library.

There are 3 issues I'd like to submit PRs for, I've opened this to provide a bit of background.

If time allows, please let me know if you see any issues with the proposed approaches. (I will be working on the PRs regardless.)

Thank you, appreciate your work!

Proposed changes follow.

3x memory usage reduction

I managed to reduce memory usage of parse() by ~3x (prototype) ~~and make it 4% faster~~, by not reading the entire feed in memory, and decoding it on the fly.

I tested it with 150 feeds I use day to day and the results were identical with those of feedparser 6.0.8.

tl;dr of the change: in parse():

use convert_to_utf8() with a small prefix of feed data (optimistically assuming the rest of the file will have the detected encoding)
for the StrictFeedParser branch
- wrap the file in a codecs.getreader() that converts it to a text stream
- pass the text stream to the SAX parser
for the LooseFeedParser branch
- seek() the file back to the initial position
- again, wrap it in a codecs.getreader()

For backwards compatibility:

I think it's acceptable for convert_to_utf8() to only use a fragment of the of the feed. However, it should still be possible to enable the old behavior of passing the entire feed to convert_to_utf8() (in the unlikely case when the encoding isn't properly detected; I propose a new argument parse(..., optimistic_encoding_detection=True).
- Not having optimistic_encoding_detection would make the code simpler, though. To increase the chances of correctly detecting encoding, we can make the feed fragment used reasonably large (e.g. 100K). Do you think this is an acceptable tradeoff?
For the fallback to LooseFeedParser to be possible, the file must be .seekable(). If it's not, we still need to read the entire file in memory, but at least we do it exactly once – from my profiling, convert_to_utf8() consumes most of the memory by having multiple copies of the entire feed (with different encodings).

Proposed refactoring (not required, but it would make parse() easier to understand/change going forward):

Add a new function parse_file(file, ...), that only takes a stream. parse() would call parse_file() after getting one from _open_resource().

Resolves: #209, maybe #287.

Make Atom summary/content not depend on order

#59 already has a pending pull request for this.

I'm not sure if it was not merged because it was never a priority, or more testing is required; if the latter, I can do the needed work, please let me know.

A sharper solution is to side-step the summary/content detection logic for Atom only. My understanding is that the reliance on order was mainly because RSS doesn't really have separate summary and content elements. Atom does, so when we get a summary tag, we can just use it for summary. Prototype (tested with 150 feeds, of which ~55 were Atom).

Resolves: #59.

defusedxml or lxml can't be used with feedparser

This can kinda be done by appending to feedparser.api.PREFERRED_XML_PARSERS, but it has a number of issues:

PREFERRED_XML_PARSERS is not documented.
PREFERRED_XML_PARSERS is feedparser global state; it wouldn't be polite for reader to change it (unless explicitly instructed by the user).
xml.sax.make_parser() falls back to the default SAX parser, which is insecure; when used in the context of untrusted input, it should either work with the specified parser, or fail.
- This can be worked around by calling xml.sax.make_parser() ahead of time, and checking the returned parser is one of the safe ones (but it's kinda brittle, since it assumes feedparser uses make_parser()).

For reference, here's what using PREFERRED_XML_PARSERS with feedparser looks like now: defusedxml, lxml.

My solution to this would be to add a new argument parse(..., sax_parser=None); for backwards compatibility, it'd default to the current behavior.

Resolves: #107 (XML parsing is not safe by default, but it is possible to make it safe).

The text was updated successfully, but these errors were encountered:

lemon24 · 2021-12-20T23:25:36Z

Progress report:

In revision 5 of the prototype, I extracted the "join a prefix to a file object" logic into a separate class (with a few tests); this is useful for a few reasons:

It keeps parse() almost as simple as it was originally.
It prevents the SAX parser from closing the stream (I don't know how to stop it from doing that); if the stream is closed, the (fallback) LooseFeedParser branch can't seek it to its original position.

I'm still thinking of how to split the change into a few refactoring commits and a few feature commits, so it's easier to review.

tadeoos · 2022-01-18T18:58:11Z

@lemon24 great initiative! These leaks are hurting us too. How far are you? Is there something I can help with or do you have a workaround for these issues?

lemon24 · 2022-01-18T22:10:38Z

@tadeoos, parse() in the gist from my previous comment works correctly for any seek-able() files (that is, to parse an URL you have to download the feed yourself and pass the open file to it).

The next step is to integrate the stuff from the gist into feedparser itself (instead wrapping some internals), without breaking anything, and without making the code significantly harder to maintain going forward.

(Sadly, I haven't had the time/disposition to continue this. It's likely I'll pick it up again in the following month, but I'm not promising anything.)

For kurtmckee#296.

…utf8(). For kurtmckee#296.

For kurtmckee#296.

Memory usage reduction (#296)

kurtmckee · 2022-06-24T19:29:25Z

Thanks for your work on PR #302!

lemon24 · 2022-06-24T19:45:29Z

@kurtmckee, I want to continue working on the other two issues when I have some time. Do you agree with the solutions proposed in my first comment?

The first one, defusedxml or lxml can't be used with feedparser, requires a (backwards compatible) API change:

My solution to this would be to add a new argument parse(..., sax_parser=None); for backwards compatibility, it'd default to the current behavior.

I'm looking for your input because this implies that feedparser will continue to use a SAX parser indefinitely (but IMHO, the existence of PREFERRED_XML_PARSERS already does).

kurtmckee · 2022-06-24T20:11:04Z

My apologies, I overlooked the additional items!

I'd like to discuss these items in more detail, but the discussion re: SAX parsers and fallbacks has been tearing at me for a long time. I want feedparser to get to a point where it's not trying to use an XML parser and then fallback to a loose parser. To that end, I'm currently writing a handler that uses html.parser.HTMLParser as its backend. It's working well so far, and I'd like to add a handler that uses lxml's HTML parser if it's importable.

I think at that point the SAX parsing could be completely ripped out and feedparser could drop its code that supports re-parsing the same document twice in two different modes. I anticipate that this lays the groundwork for feedparser to begin parsing h-feeds.

Do you think that migrating from XML/sgml parsers to HTML/lxml parsers could address the current lack of support for defused?

lemon24 · 2022-06-25T09:38:03Z

Overall, I am +1 on not using an XML parser (if it's not an overwhelming amount of work to do).

From what I understand, most issues with XML come from it being able to do too much (file/network access etc.), and libraries assuming it comes from a trusted source; a feed is pretty much in the opposite situation.

This will likely simplify things for decoding as well:

AFAICT, convert_to_utf8() and replace_doctype() both need to change the stream prefix for the benefit of the XML parser
if there's no XML parser, the detection logic can be moved in the new parser
... so there's no need to change the stream
... so we can remove PrefixFileWrapper (and maybe StreamFactory too!)

Regarding the need for defusedxml:

html.parser.HTMLParser: IMO this sidesteps the issue entirely, since being relatively low level it won't have the nasty behaviors out of the box.
lxml XML parser: Based on this table, lxml seems to be less vulnerable than most other XML libraries (only to quadratic blowup and external entity expansion (local file)).
lxml HTML parsers (there's more than one): don't know about them, but they might be even less vulnerable (by being lower-level). For reference, lxml does seem to support a feed()-handle_event()-style interface.

So, given your plans, I think it's more than reasonable to not move forward with defusedxml / custom SAX parser support (feel free to close this).

lemon24 · 2022-06-25T10:34:16Z

While writing the comment above, I realized there may be a number of opportunities presented by a reworking of the parsing backend.

I understand you're operating under a number of constraints: time, backwards compatibility, and, I assume, pure-Python-only dependencies – at least out of the box. Even if none of the following are included, I still think it's useful to list them out, in case they influence your design.

Streaming support: instead of returning a list of entries, yield them one by one.

This would further reduce the amount of memory used.

Support Bleach for HTML sanitization.

I'm aware of Use bleach for HTML sanitizing #257 (comment). (For me personally, "extremely strict" sounds good :)
This may work by any of:
- use the html5lib parser directly
- use it through lxml.html.html5parser
- adapt another parser to use bleach.sanitizer.BleachSanitizerFilter (without html5lib)

Use BeautifulSoup for broken feed parsing and/or broken encoding handling.

This got my attention while looking though the lxml.html.soupparser docs:

A very nice feature of BeautifulSoup is its excellent support for encoding detection which can provide better results for real-world HTML pages that do not (correctly) declare their encoding.

From what I understand, feedparser is really old, and it needed to do a lot of work for things that aren't strictly related to feeds because there was no better option available at the time (and the packaging story wasn't ideal either). Since then, the library ecosystem has matured quite a bit, and (carefully) farming out some of that work to others might lower the maintenance burden significantly.

I am not suggesting you do any of the above, just bringing them up so they may have an influence on whatever happens next. If you find one of them particularly interesting, I can look into it further when I have some time.

Once again, thank you for all your work on feedparser!

lemon24 mentioned this issue Dec 18, 2021

Consider working around feedparser's issues lemon24/reader#265

Open

kurtmckee self-assigned this Dec 19, 2021

kurtmckee added character detection parser Parser implementations (expat, libxml2, etc) performance labels Dec 19, 2021

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 24, 2022

Add stream-oriented version of convert_to_utf8().

58a8256

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 24, 2022

Add stream-oriented version of convert_to_utf8().

80b31af

For kurtmckee#296.

lemon24 mentioned this issue Jan 24, 2022

Memory usage reduction (#296) #302

Merged

7 tasks

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 24, 2022

Add stream-oriented version of convert_to_utf8().

c8b98ab

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 24, 2022

Add stream-oriented version of convert_to_utf8().

9c024d5

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 25, 2022

Add stream-oriented version of convert_to_utf8().

be3eab6

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 25, 2022

Extract _parse_file_inplace() from parse().

d62bb1d

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 25, 2022

Update JSONParser.feed() to take a file instead of a string.

2d649e6

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 25, 2022

WORK-IN-PROGRESS Update _parse_file_inplace() to use convert_file_to_…

2a40832

…utf8(). For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 27, 2022

Add stream-oriented version of convert_to_utf8().

d10ffee

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 27, 2022

Extract _parse_file_inplace() from parse().

4b41bbe

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 27, 2022

Update JSONParser.feed() to take a file instead of a string.

991bd79

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 27, 2022

Update _parse_file_inplace() to use convert_file_to_utf8().

4503fa8

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 28, 2022

Only read the feed in memory if the underlying resource is not seekable.

00b29e4

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022

Add stream-oriented version of convert_to_utf8().

87ceb00

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022

Extract _parse_file_inplace() from parse().

b40f456

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022

Update JSONParser.feed() to take a file instead of a string.

e8de4cc

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022

Update _parse_file_inplace() to use convert_file_to_utf8().

338b665

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022

Only read the feed in memory if the underlying resource is not seekable.

fba93a7

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022

Update _parse_file_inplace() to use convert_file_to_utf8().

5d36bd9

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022

Only read the feed in memory if the underlying resource is not seekable.

e997007

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022

Update _parse_file_inplace() to use convert_file_to_utf8().

31a7a2c

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022

Only read the feed in memory if the underlying resource is not seekable.

131c2c7

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Feb 6, 2022

Add stream-oriented version of convert_to_utf8().

d865884

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Feb 6, 2022

Extract _parse_file_inplace() from parse().

6797673

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Feb 6, 2022

Update JSONParser.feed() to take a file instead of a string.

9b0d359

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Feb 6, 2022

Update _parse_file_inplace() to use convert_file_to_utf8().

a9fd6a6

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Feb 6, 2022

Only read the feed in memory if the underlying resource is not seekable.

a29c54b

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Apr 25, 2022

In convert_file_to_utf8(), check the entire file can be decoded.

a1fe3e1

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue Apr 25, 2022

Changelog for kurtmckee#296/kurtmckee#302.

31c19bb

lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022

Add stream-oriented version of convert_to_utf8().

4eb0e3a

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022

Extract _parse_file_inplace() from parse().

3ea1c54

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022

Update JSONParser.feed() to take a file instead of a string.

0806be0

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022

Update _parse_file_inplace() to use convert_file_to_utf8().

29eba2c

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022

Only read the feed in memory if the underlying resource is not seekable.

84fb1c2

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022

In convert_file_to_utf8(), check the entire file can be decoded.

25599ee

For kurtmckee#296.

lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022

Changelog for kurtmckee#296/kurtmckee#302.

2b93e89

kurtmckee added a commit that referenced this issue Jun 24, 2022

Merge pull request #302 from lemon24/296-memory

a48e403

Memory usage reduction (#296)

kurtmckee closed this as completed Jun 24, 2022

kurtmckee reopened this Jun 24, 2022

twm mentioned this issue Feb 8, 2023

Use bleach for HTML sanitizing #257

Closed

LukeMurphey mentioned this issue Feb 14, 2023

Feedparser parsing of untrusted XML is not ideal LukeMurphey/splunk-syndication-input#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan for upcoming 3x memory improvement PR (and a few others) #296

Plan for upcoming 3x memory improvement PR (and a few others) #296

lemon24 commented Dec 18, 2021 •

edited

Loading

lemon24 commented Dec 20, 2021 •

edited

Loading

tadeoos commented Jan 18, 2022

lemon24 commented Jan 18, 2022

kurtmckee commented Jun 24, 2022

lemon24 commented Jun 24, 2022

kurtmckee commented Jun 24, 2022 •

edited

Loading

lemon24 commented Jun 25, 2022

lemon24 commented Jun 25, 2022 •

edited

Loading

Plan for upcoming 3x memory improvement PR (and a few others) #296

Plan for upcoming 3x memory improvement PR (and a few others) #296

Comments

lemon24 commented Dec 18, 2021 • edited Loading

3x memory usage reduction

Make Atom summary/content not depend on order

defusedxml or lxml can't be used with feedparser

lemon24 commented Dec 20, 2021 • edited Loading

tadeoos commented Jan 18, 2022

lemon24 commented Jan 18, 2022

kurtmckee commented Jun 24, 2022

lemon24 commented Jun 24, 2022

kurtmckee commented Jun 24, 2022 • edited Loading

lemon24 commented Jun 25, 2022

lemon24 commented Jun 25, 2022 • edited Loading

lemon24 commented Dec 18, 2021 •

edited

Loading

lemon24 commented Dec 20, 2021 •

edited

Loading

kurtmckee commented Jun 24, 2022 •

edited

Loading

lemon24 commented Jun 25, 2022 •

edited

Loading