Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimize and memory usage optimize #209

Open
guyskk opened this issue Mar 28, 2020 · 10 comments
Open

Performance optimize and memory usage optimize #209

guyskk opened this issue Mar 28, 2020 · 10 comments

Comments

@guyskk
Copy link

guyskk commented Mar 28, 2020

I find that feedparser is very slow when parse large feeds, and it's also not fast when parse small feeds.

For example (about 5 MB): https://aotu.io/atom.xml
feedparser cost 15 seconds to parse the feed, while another parser in golang (gofeed) only cost 100ms.

Another example (about 300KB): http://ohmymedia.com/feed/
feedparser cost 400ms while gofeed cost less than 10ms.

So I think there's much room for improvement. I did some analysis using pyinstrument, it shows resolve_relative_uris and _sanitize_html cost most of the time. If we replace them with lxml or other C implementation it would be very fast.

image

The memory usage is not very efficient too, string copy, encode and decode operations cost lot's of memory. I think it have some room for improvement but I didn't deep analysis it yet.

I'm the author of RSSAnt, a RSS reader web app, and use feedparser to parse feeds. The performance is very critical for me, and I'm glad to implement performance optimization for feedparser.

Do you have any suggestions?

@guyskk
Copy link
Author

guyskk commented Mar 28, 2020

Set sanitize_html and resolve_relative_uris to False, then use lxml to process feedparser result is another option, and by this way I only process the needed fields, not all fields.

@kurtmckee
Copy link
Owner

@guyskk I agree that there is a lot of room for improvement, and I'm really grateful for the analysis you've already done. I'd like to draw from your experience with RSSAnt, and improve feedparser so that it's meeting your needs better.

I want to update this ticket so that it has a clear objective and can be driven to closure. I'm seeing several suggestions here, including:

  • Migrate resolve_relative_uris to C
  • Migrate _sanitize_html to C code (note that I would like to strip this custom code and use an external package)
  • Decrease encode/decode calls
  • Decrease string copies

Are there any other things that you can suggest besides these? After determining goals, let's establish priorities and possibly open additional tickets if needed.

Thanks for investigating this and reporting the results! I look forward to working with you!

@guyskk
Copy link
Author

guyskk commented Sep 1, 2020

Thank you Kurt! Currently no other suggestions in my mind.
And I'm glad to build a RSS dataset so that we can benchmark/profile on it.

@thigg
Copy link

thigg commented Dec 29, 2020

It would be really good to have this split up in small issues. With smaller, better specified packages people (like me) could pick up work.

@thigg
Copy link

thigg commented Mar 9, 2021

I only spent a short time investigating, but my profiler showed that lot time was spent in sgmllib which is deprecated and hasn't seen updates since 2010. @kurtmckee do you think it could be beneficial to replace it?

@thigg
Copy link

thigg commented Oct 18, 2022

Is there any plan to progress this issue? Did anyone do experiments with lxml? Does it make parsing faster?

@mlissner
Copy link
Contributor

We're in the process of parsing about three million RSS feeds from federal courts, many of which are over 1MB in size.

We may take a look at making feedparser faster since it's currently our bottleneck in this project. @kurtmckee I assume such work is still welcome?

@kurtmckee
Copy link
Owner

Yes, please!

I'm currently working to migrate the test suite to pytest and make sure code coverage is getting checked. It's slow going, but I'm very open to performance improvements!

@buhtz
Copy link

buhtz commented Mar 13, 2023

Why pytest?
I'm shocked but your opinion has value for me. So I ask to learn. :)

IMHO the only good thing about pytest is its commandline tool with nice colored output that does run my unittest-like tests. The problem with pytest-like tests is that they are hard to understand because they hide to much and doing to much things explicit. This is always the "pro" argument on conferences and blog posts: You need less lines of code to write your tests. This isn't a pro but a contra argument. I would argue if someone things there are to many lines in a (unit) test then the test is of low quality.

But it is just my opinion as a less experienced none-professional developer. So I ask to learn.
So I think myself: If Kurt migrate to pytest there are very good reasons. I just don't see them. 😄

@kurtmckee
Copy link
Owner

There are a number of people subscribed to this thread, so -- without turning this into a pytest discussion thread! -- I'll summarize that:

  • it's possible to run tests in parallel
  • it's possible to randomize module and test execution order, which can shake out unexpected intradependencies in the test suite
  • it's possible to parametrize a test, which moves for loops inside a test function to outside the test function, and reports the results on all of the tested values, not just the first one that fails
  • pytest's dependency injection model makes it easy to clearly state "this test requires such-and-such mock or setup" (such as mocked HTTP responses)

These are some of the reasons I use pytest. I don't want this thread to become a discussion about pytest so I'll respond more in-depth to you privately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants