Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider supporting alternative feed parsers #264

Closed
lemon24 opened this issue Nov 18, 2021 · 3 comments
Closed

Consider supporting alternative feed parsers #264

lemon24 opened this issue Nov 18, 2021 · 3 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Nov 18, 2021

In light of various issues feedparser has (see #265), I think it's wise we consider other feed parser implementations to use.

In this issue, we'll:

  • Summarize possible alternatives.
  • Document the changes needed to use an alternative implementation.
@lemon24
Copy link
Owner Author

lemon24 commented Nov 18, 2021

The logical pipeline of parsing a feed:

  • detect encoding (from headers or stream)
  • detect xml:base (from headers; needed for relative link resolution – Broken relative links #125)
  • detect high-level format (XML, JSON)
  • parse XML/JSON stream into intermediary generic Python data structure (ElementTree, JSON dict)
    • can be secure or not (Consider using defusedxml #212)
    • can support malformed high-level markup (both feedparser and lxml can do this; broken XML does exist in the wild)
    • store xml:base
  • detect feed format (RSS, Atom, JSON Feed)
  • convert generic Python data structure into feed data structure
  • resolve relative links
    • optional
  • sanitize content (Broken relative links #125 (comment), JSON feed content is not sanitized #227)
    • optional
  • unify feed data structure (so it looks the same regardless of feed format)

Currently:

  • feedparser goes directly from stream to feed data structure by using xml.sax.
  • For JSON Feed, reader only relies on the inferred MIME type; feedparser does some sniffing to detect the feed format, and we rely on that (that is, reader has no logic to tell RSS apart from Atom etc.).
  • Both relative link resolution (requires xml:base) and content sanitization can happen before or after storage; feedparser does them before storage, and I'm not sure if we can use it to do it after (the things it uses probably aren't stable, and they tie into the sax parsing logic).

@lemon24
Copy link
Owner Author

lemon24 commented Nov 29, 2021

I've pretty much decided to continue using feedparser (#265 (comment)) and not switching to Atoma (#263), but it's worth documenting the factors that went into it.

I looked at feedparser 6.0.8, and Atoma 0.0.17.

feedparser Atoma
stable yes no (0.x)
maintainer responsiveness low high
format detection yes yes (tries to parse all formats)
JSON feed no yes
old feed formats yes no
Atom/RSS extensions medium high
file objects yes yes (no autodetection)
memory usage high (reads feed in memory multiple times) medium (builds whole etree)
typed no yes
safe XML no yes (defusedxml)
pluggable XML parser (defusedxml, lxml) no (yes with global/monkeypatching) no
bad encodings yes no
malformed feeds yes no
relative link resolution yes (can be disabled, exposes XML base) no
HTML sanitization yes (can be disabled) no
unified feed/entry interface yes no

@lemon24
Copy link
Owner Author

lemon24 commented Jan 29, 2022

Closing in favor of #265.

@lemon24 lemon24 closed this as completed Jan 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant