-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance optimize and memory usage optimize #209
Comments
Set sanitize_html and resolve_relative_uris to False, then use lxml to process feedparser result is another option, and by this way I only process the needed fields, not all fields. |
@guyskk I agree that there is a lot of room for improvement, and I'm really grateful for the analysis you've already done. I'd like to draw from your experience with RSSAnt, and improve feedparser so that it's meeting your needs better. I want to update this ticket so that it has a clear objective and can be driven to closure. I'm seeing several suggestions here, including:
Are there any other things that you can suggest besides these? After determining goals, let's establish priorities and possibly open additional tickets if needed. Thanks for investigating this and reporting the results! I look forward to working with you! |
Thank you Kurt! Currently no other suggestions in my mind. |
It would be really good to have this split up in small issues. With smaller, better specified packages people (like me) could pick up work. |
I only spent a short time investigating, but my profiler showed that lot time was spent in sgmllib which is deprecated and hasn't seen updates since 2010. @kurtmckee do you think it could be beneficial to replace it? |
Is there any plan to progress this issue? Did anyone do experiments with lxml? Does it make parsing faster? |
We're in the process of parsing about three million RSS feeds from federal courts, many of which are over 1MB in size. We may take a look at making feedparser faster since it's currently our bottleneck in this project. @kurtmckee I assume such work is still welcome? |
Yes, please! I'm currently working to migrate the test suite to pytest and make sure code coverage is getting checked. It's slow going, but I'm very open to performance improvements! |
Why pytest? IMHO the only good thing about pytest is its commandline tool with nice colored output that does run my unittest-like tests. The problem with pytest-like tests is that they are hard to understand because they hide to much and doing to much things explicit. This is always the "pro" argument on conferences and blog posts: You need less lines of code to write your tests. This isn't a pro but a contra argument. I would argue if someone things there are to many lines in a (unit) test then the test is of low quality. But it is just my opinion as a less experienced none-professional developer. So I ask to learn. |
There are a number of people subscribed to this thread, so -- without turning this into a pytest discussion thread! -- I'll summarize that:
These are some of the reasons I use pytest. I don't want this thread to become a discussion about pytest so I'll respond more in-depth to you privately. |
I find that feedparser is very slow when parse large feeds, and it's also not fast when parse small feeds.
For example (about 5 MB):
https://aotu.io/atom.xml
feedparser cost 15 seconds to parse the feed, while another parser in golang (gofeed) only cost 100ms.
Another example (about 300KB):
http://ohmymedia.com/feed/
feedparser cost 400ms while gofeed cost less than 10ms.
So I think there's much room for improvement. I did some analysis using pyinstrument, it shows
resolve_relative_uris
and_sanitize_html
cost most of the time. If we replace them with lxml or other C implementation it would be very fast.The memory usage is not very efficient too, string copy, encode and decode operations cost lot's of memory. I think it have some room for improvement but I didn't deep analysis it yet.
I'm the author of RSSAnt, a RSS reader web app, and use feedparser to parse feeds. The performance is very critical for me, and I'm glad to implement performance optimization for feedparser.
Do you have any suggestions?
The text was updated successfully, but these errors were encountered: