Performance optimize and memory usage optimize #209

guyskk · 2020-03-28T13:13:16Z

I find that feedparser is very slow when parse large feeds, and it's also not fast when parse small feeds.

For example (about 5 MB): https://aotu.io/atom.xml
feedparser cost 15 seconds to parse the feed, while another parser in golang (gofeed) only cost 100ms.

Another example (about 300KB): http://ohmymedia.com/feed/
feedparser cost 400ms while gofeed cost less than 10ms.

So I think there's much room for improvement. I did some analysis using pyinstrument, it shows resolve_relative_uris and _sanitize_html cost most of the time. If we replace them with lxml or other C implementation it would be very fast.

The memory usage is not very efficient too, string copy, encode and decode operations cost lot's of memory. I think it have some room for improvement but I didn't deep analysis it yet.

I'm the author of RSSAnt, a RSS reader web app, and use feedparser to parse feeds. The performance is very critical for me, and I'm glad to implement performance optimization for feedparser.

Do you have any suggestions?

The text was updated successfully, but these errors were encountered:

guyskk · 2020-03-28T14:30:04Z

Set sanitize_html and resolve_relative_uris to False, then use lxml to process feedparser result is another option, and by this way I only process the needed fields, not all fields.

kurtmckee · 2020-09-01T11:55:51Z

@guyskk I agree that there is a lot of room for improvement, and I'm really grateful for the analysis you've already done. I'd like to draw from your experience with RSSAnt, and improve feedparser so that it's meeting your needs better.

I want to update this ticket so that it has a clear objective and can be driven to closure. I'm seeing several suggestions here, including:

Migrate resolve_relative_uris to C
Migrate _sanitize_html to C code (note that I would like to strip this custom code and use an external package)
Decrease encode/decode calls
Decrease string copies

Are there any other things that you can suggest besides these? After determining goals, let's establish priorities and possibly open additional tickets if needed.

Thanks for investigating this and reporting the results! I look forward to working with you!

kurtmckee/feedparser/issues/209

guyskk · 2020-09-01T16:16:39Z

Thank you Kurt! Currently no other suggestions in my mind.
And I'm glad to build a RSS dataset so that we can benchmark/profile on it.

thigg · 2020-12-29T13:54:44Z

It would be really good to have this split up in small issues. With smaller, better specified packages people (like me) could pick up work.

thigg · 2021-03-09T11:06:16Z

I only spent a short time investigating, but my profiler showed that lot time was spent in sgmllib which is deprecated and hasn't seen updates since 2010. @kurtmckee do you think it could be beneficial to replace it?

thigg · 2022-10-18T20:34:23Z

Is there any plan to progress this issue? Did anyone do experiments with lxml? Does it make parsing faster?

mlissner · 2023-02-24T21:35:28Z

We're in the process of parsing about three million RSS feeds from federal courts, many of which are over 1MB in size.

We may take a look at making feedparser faster since it's currently our bottleneck in this project. @kurtmckee I assume such work is still welcome?

kurtmckee · 2023-03-13T19:43:43Z

Yes, please!

I'm currently working to migrate the test suite to pytest and make sure code coverage is getting checked. It's slow going, but I'm very open to performance improvements!

buhtz · 2023-03-13T21:15:56Z

Why pytest?
I'm shocked but your opinion has value for me. So I ask to learn. :)

IMHO the only good thing about pytest is its commandline tool with nice colored output that does run my unittest-like tests. The problem with pytest-like tests is that they are hard to understand because they hide to much and doing to much things explicit. This is always the "pro" argument on conferences and blog posts: You need less lines of code to write your tests. This isn't a pro but a contra argument. I would argue if someone things there are to many lines in a (unit) test then the test is of low quality.

But it is just my opinion as a less experienced none-professional developer. So I ask to learn.
So I think myself: If Kurt migrate to pytest there are very good reasons. I just don't see them. 😄

kurtmckee · 2023-03-13T21:54:45Z

There are a number of people subscribed to this thread, so -- without turning this into a pytest discussion thread! -- I'll summarize that:

it's possible to run tests in parallel
it's possible to randomize module and test execution order, which can shake out unexpected intradependencies in the test suite
it's possible to parametrize a test, which moves for loops inside a test function to outside the test function, and reports the results on all of the tested values, not just the first one that fails
pytest's dependency injection model makes it easy to clearly state "this test requires such-and-such mock or setup" (such as mocked HTTP responses)

These are some of the reasons I use pytest. I don't want this thread to become a discussion about pytest so I'll respond more in-depth to you privately.

kurtmckee added the performance label Sep 1, 2020

Safihre added a commit to sabnzbd/sabnzbd that referenced this issue Sep 1, 2020

Disable the relative URL resolver in feedparser

2d588a6

kurtmckee/feedparser/issues/209

lemon24 mentioned this issue Nov 29, 2021

Consider working around feedparser's issues lemon24/reader#265

Open

lemon24 mentioned this issue Dec 18, 2021

Plan for upcoming 3x memory improvement PR (and a few others) #296

Open

Rongronggg9 mentioned this issue May 21, 2022

Memory usage reduction (#296) #302

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimize and memory usage optimize #209

Performance optimize and memory usage optimize #209

guyskk commented Mar 28, 2020

guyskk commented Mar 28, 2020

kurtmckee commented Sep 1, 2020

guyskk commented Sep 1, 2020

thigg commented Dec 29, 2020

thigg commented Mar 9, 2021 •

edited

Loading

thigg commented Oct 18, 2022

mlissner commented Feb 24, 2023

kurtmckee commented Mar 13, 2023

buhtz commented Mar 13, 2023 •

edited

Loading

kurtmckee commented Mar 13, 2023

Performance optimize and memory usage optimize #209

Performance optimize and memory usage optimize #209

Comments

guyskk commented Mar 28, 2020

guyskk commented Mar 28, 2020

kurtmckee commented Sep 1, 2020

guyskk commented Sep 1, 2020

thigg commented Dec 29, 2020

thigg commented Mar 9, 2021 • edited Loading

thigg commented Oct 18, 2022

mlissner commented Feb 24, 2023

kurtmckee commented Mar 13, 2023

buhtz commented Mar 13, 2023 • edited Loading

kurtmckee commented Mar 13, 2023

thigg commented Mar 9, 2021 •

edited

Loading

buhtz commented Mar 13, 2023 •

edited

Loading