Consider supporting alternative feed parsers #264

lemon24 · 2021-11-18T18:40:21Z

In light of various issues feedparser has (see #265), I think it's wise we consider other feed parser implementations to use.

In this issue, we'll:

Summarize possible alternatives.
Document the changes needed to use an alternative implementation.

lemon24 · 2021-11-18T19:31:59Z

The logical pipeline of parsing a feed:

detect encoding (from headers or stream)
detect xml:base (from headers; needed for relative link resolution – Broken relative links #125)
detect high-level format (XML, JSON)
parse XML/JSON stream into intermediary generic Python data structure (ElementTree, JSON dict)
- can be secure or not (Consider using defusedxml #212)
- can support malformed high-level markup (both feedparser and lxml can do this; broken XML does exist in the wild)
- store xml:base
detect feed format (RSS, Atom, JSON Feed)
convert generic Python data structure into feed data structure
resolve relative links
- optional
sanitize content (Broken relative links #125 (comment), JSON feed content is not sanitized #227)
- optional
unify feed data structure (so it looks the same regardless of feed format)

Currently:

feedparser goes directly from stream to feed data structure by using xml.sax.
- lxml has a way of converting an etree into sax events (Consider using defusedxml #212 (comment)).
For JSON Feed, reader only relies on the inferred MIME type; feedparser does some sniffing to detect the feed format, and we rely on that (that is, reader has no logic to tell RSS apart from Atom etc.).
Both relative link resolution (requires xml:base) and content sanitization can happen before or after storage; feedparser does them before storage, and I'm not sure if we can use it to do it after (the things it uses probably aren't stable, and they tie into the sax parsing logic).

lemon24 · 2021-11-29T14:19:34Z

I've pretty much decided to continue using feedparser (#265 (comment)) and not switching to Atoma (#263), but it's worth documenting the factors that went into it.

I looked at feedparser 6.0.8, and Atoma 0.0.17.

	feedparser	Atoma
stable	yes	no (0.x)
maintainer responsiveness	low	high
format detection	yes	yes (tries to parse all formats)
JSON feed	no	yes
old feed formats	yes	no
Atom/RSS extensions	medium	high
file objects	yes	yes (no autodetection)
memory usage	high (reads feed in memory multiple times)	medium (builds whole etree)
typed	no	yes
safe XML	no	yes (defusedxml)
pluggable XML parser (defusedxml, lxml)	no (yes with global/monkeypatching)	no
bad encodings	yes	no
malformed feeds	yes	no
relative link resolution	yes (can be disabled, exposes XML base)	no
HTML sanitization	yes (can be disabled)	no
unified feed/entry interface	yes	no

lemon24 · 2022-01-29T08:53:25Z

Closing in favor of #265.

lemon24 added a commit that referenced this issue Jan 29, 2022

Dev notes for #263 and #264.

3c48990

lemon24 added core feed parsing labels Jan 29, 2022

lemon24 closed this as completed Jan 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider supporting alternative feed parsers #264

Consider supporting alternative feed parsers #264

lemon24 commented Nov 18, 2021 •

edited

Loading

lemon24 commented Nov 18, 2021 •

edited

Loading

lemon24 commented Nov 29, 2021 •

edited

Loading

lemon24 commented Jan 29, 2022

Consider supporting alternative feed parsers #264

Consider supporting alternative feed parsers #264

Comments

lemon24 commented Nov 18, 2021 • edited Loading

lemon24 commented Nov 18, 2021 • edited Loading

lemon24 commented Nov 29, 2021 • edited Loading

lemon24 commented Jan 29, 2022

lemon24 commented Nov 18, 2021 •

edited

Loading

lemon24 commented Nov 18, 2021 •

edited

Loading

lemon24 commented Nov 29, 2021 •

edited

Loading