-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plan for upcoming 3x memory improvement PR (and a few others) #296
Comments
Progress report: In revision 5 of the prototype, I extracted the "join a prefix to a file object" logic into a separate class (with a few tests); this is useful for a few reasons:
I'm still thinking of how to split the change into a few refactoring commits and a few feature commits, so it's easier to review. |
@lemon24 great initiative! These leaks are hurting us too. How far are you? Is there something I can help with or do you have a workaround for these issues? |
@tadeoos, parse() in the gist from my previous comment works correctly for any seek-able() files (that is, to parse an URL you have to download the feed yourself and pass the open file to it). The next step is to integrate the stuff from the gist into feedparser itself (instead wrapping some internals), without breaking anything, and without making the code significantly harder to maintain going forward. (Sadly, I haven't had the time/disposition to continue this. It's likely I'll pick it up again in the following month, but I'm not promising anything.) |
Thanks for your work on PR #302! |
@kurtmckee, I want to continue working on the other two issues when I have some time. Do you agree with the solutions proposed in my first comment? The first one, defusedxml or lxml can't be used with feedparser, requires a (backwards compatible) API change:
I'm looking for your input because this implies that feedparser will continue to use a SAX parser indefinitely (but IMHO, the existence of PREFERRED_XML_PARSERS already does). |
My apologies, I overlooked the additional items! I'd like to discuss these items in more detail, but the discussion re: SAX parsers and fallbacks has been tearing at me for a long time. I want feedparser to get to a point where it's not trying to use an XML parser and then fallback to a loose parser. To that end, I'm currently writing a handler that uses I think at that point the SAX parsing could be completely ripped out and feedparser could drop its code that supports re-parsing the same document twice in two different modes. I anticipate that this lays the groundwork for feedparser to begin parsing h-feeds. Do you think that migrating from XML/sgml parsers to HTML/lxml parsers could address the current lack of support for defused? |
Overall, I am +1 on not using an XML parser (if it's not an overwhelming amount of work to do). From what I understand, most issues with XML come from it being able to do too much (file/network access etc.), and libraries assuming it comes from a trusted source; a feed is pretty much in the opposite situation. This will likely simplify things for decoding as well:
Regarding the need for defusedxml:
So, given your plans, I think it's more than reasonable to not move forward with defusedxml / custom SAX parser support (feel free to close this). |
While writing the comment above, I realized there may be a number of opportunities presented by a reworking of the parsing backend. I understand you're operating under a number of constraints: time, backwards compatibility, and, I assume, pure-Python-only dependencies – at least out of the box. Even if none of the following are included, I still think it's useful to list them out, in case they influence your design.
From what I understand, feedparser is really old, and it needed to do a lot of work for things that aren't strictly related to feeds because there was no better option available at the time (and the packaging story wasn't ideal either). Since then, the library ecosystem has matured quite a bit, and (carefully) farming out some of that work to others might lower the maintenance burden significantly. I am not suggesting you do any of the above, just bringing them up so they may have an influence on whatever happens next. If you find one of them particularly interesting, I can look into it further when I have some time. Once again, thank you for all your work on feedparser! |
Hi @kurtmckee,
I'm using feedparser heavily in reader, a feed reader library.
There are 3 issues I'd like to submit PRs for, I've opened this to provide a bit of background.
If time allows, please let me know if you see any issues with the proposed approaches. (I will be working on the PRs regardless.)
Thank you, appreciate your work!
Proposed changes follow.
3x memory usage reduction
I managed to reduce memory usage of parse() by ~3x (prototype)
and make it 4% faster, by not reading the entire feed in memory, and decoding it on the fly.I tested it with 150 feeds I use day to day and the results were identical with those of feedparser 6.0.8.
tl;dr of the change: in parse():
For backwards compatibility:
parse(..., optimistic_encoding_detection=True)
.optimistic_encoding_detection
would make the code simpler, though. To increase the chances of correctly detecting encoding, we can make the feed fragment used reasonably large (e.g. 100K). Do you think this is an acceptable tradeoff?Proposed refactoring (not required, but it would make parse() easier to understand/change going forward):
parse_file(file, ...)
, that only takes a stream. parse() would call parse_file() after getting one from _open_resource().Resolves: #209, maybe #287.
Make Atom summary/content not depend on order
#59 already has a pending pull request for this.
I'm not sure if it was not merged because it was never a priority, or more testing is required; if the latter, I can do the needed work, please let me know.
A sharper solution is to side-step the summary/content detection logic for Atom only. My understanding is that the reliance on order was mainly because RSS doesn't really have separate summary and content elements. Atom does, so when we get a summary tag, we can just use it for summary. Prototype (tested with 150 feeds, of which ~55 were Atom).
Resolves: #59.
defusedxml or lxml can't be used with feedparser
This can kinda be done by appending to feedparser.api.PREFERRED_XML_PARSERS, but it has a number of issues:
For reference, here's what using PREFERRED_XML_PARSERS with feedparser looks like now: defusedxml, lxml.
My solution to this would be to add a new argument
parse(..., sax_parser=None)
; for backwards compatibility, it'd default to the current behavior.Resolves: #107 (XML parsing is not safe by default, but it is possible to make it safe).
The text was updated successfully, but these errors were encountered: