Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Table locks / slow queries on 0.19.6 betas #4983

Closed
5 tasks done
dessalines opened this issue Aug 20, 2024 · 24 comments
Closed
5 tasks done

[Bug]: Table locks / slow queries on 0.19.6 betas #4983

dessalines opened this issue Aug 20, 2024 · 24 comments
Labels
area: database bug Something isn't working
Milestone

Comments

@dessalines
Copy link
Member

Requirements

  • Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a single bug? Do not put multiple bugs in one issue.
  • Do you agree to follow the rules in our Code of Conduct?
  • Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

Earlier tonight I tried to deploy 0.19.6-beta.6 to lemmy.ml, after having tested various versions of it on voyager.lemmy.ml for a few weeks. Post queries start stalling out pretty quickly, and it becomes unusable.

I didn't think we changed anything major with the post queries, so this could be trigger related, or something to do with the site and community aggregates causing locks.

Also the controversial migration does take ~30m and locks up things, but I spose that's unavoidable, and not too big a deal since its only run once.

I turned on pg_stat_statements and got this:

Screenshot_20240819_212437_Termux

For now I restored lemmy.ml from the backup I took before.

cc @dullbananas @phiresky @Nutomic

Steps to Reproduce

NA

Technical Details

NA

Version

0.19.6

Lemmy Instance URL

voyager.lemmy.ml

@dessalines dessalines added the bug Something isn't working label Aug 20, 2024
@dessalines dessalines added this to the 0.19.6 milestone Aug 20, 2024
@phiresky
Copy link
Collaborator

phiresky commented Aug 20, 2024

what's the time frame the screenshot data is from? purely after the upgrade? how long did it run?
The second query with 161s total time and 15k calls is the get_instance_followed_community_inboxes function. Depending on the time frame that's a lot of calls and the time is much higher than it should be on average (most of the calls should be WHERE published > now() - interval '1 minute' which is indexed, returns an empty list, and should take <1ms.

Could just be due to the general DB overload though.

@phiresky
Copy link
Collaborator

phiresky commented Aug 20, 2024

One more thing, after a 30min+ downtime the instance will get hammered by incoming federation queries trying to catch the instance up to the current state of the network. Since our incoming federation is not limited, it will be implicitly only limited by resource limits (CPU+DB), which might also appear as general load and cause everything to slow down.

So if you're looking at perf issues, make sure that you only start measuring once the federation state is up to date for all incoming instances.

@dessalines
Copy link
Member Author

what's the time frame the screenshot data is from? purely after the upgrade? how long did it run?

I turned this on ~ 20 minutes after the migrations finished, and startup, after I saw things were going slow. I let it run for maybe 5-10m.

One more thing, after a 30min+ downtime the instance will get hammered by incoming federation queries trying to catch the instance up to the current state of the network.

I checked the federation queue using your site after this happened, and it was up to date. iirc I also tried turning off that separate dedicated federation docker-container, and it was still slow, so its pry not federation related.

This is gonna be a tough one to solve, and we probably need to look at changes to the DB and the post query list function since 0.19.5. Its no rush tho, we should probably wait till @Nutomic gets back next month anyway for the next release.

@Nutomic
Copy link
Member

Nutomic commented Aug 21, 2024

I checked the federation queue using your site after this happened, and it was up to date. iirc I also tried turning off that separate dedicated federation docker-container, and it was still slow, so its pry not federation related.

That container only handles outgoing federation, the incoming federation which phiresky mentioned goes through the main container which handles http requests.

@phiresky
Copy link
Collaborator

phiresky commented Aug 21, 2024

Yeah, it's a bit harder to check incoming federation state which is the important part here. Outgoing federation will be idle after downtime.

A query like select * from federation_queue_state join instance on instance_id = instance.id where not instance.dead and last_successful_id > 1e7 order by last_successful_published_time asc might work. the 1e7 filter is to ignore small instances which are unlikely to overload the instance, and last_successful_published_time shows when the last processed activity is. they should all be close to the current time.

@Nutomic
Copy link
Member

Nutomic commented Sep 13, 2024

Is this issue still relevant? Weve been running 0.19.6-beta1 on lemmy.ml for a while now, and I havent noticed any problems. This is also the only issue remaining inthe 0.19.6 milestone, so once its closed we can publish the new release.

@dessalines
Copy link
Member Author

Yes, beta1 doesn't have any of the DB changes, only that one specific federation commit. So we still have to investigate which commit is causing the slowness.

@Nutomic
Copy link
Member

Nutomic commented Sep 24, 2024

The changes to post_view.rs are really trivial so that cant be the problem:

git log --oneline --follow 0.19.5..0.19.6-beta.6 -- crates/db_views/src/post_view.rs
53a226b Add show_nsfw override filter to GetPosts. (#4889)
32cee9c Fixing not being able to create comments on local community posts. (#4854)
d09854a Adding a show_read override to GetPosts. (#4846)
6d8d231 Adding an image_details table to store image dimensions. (#4704)

And for triggers theres only a single change which also looks very simple:
git log --oneline --follow 0.19.5..0.19.6-beta.6 -- crates/db_schema/replaceable_schema/triggers.sql
78702b5 Use trigger to generate apub URL in insert instead of update, and fix query planner options not being set when TLS is disabled (#4797)

And migrations:
git log --oneline --follow 0.19.5..0.19.6-beta.6 -- migrations/
33fd317 Adding a URL max length lemmy error. (#4960)
78702b5 Use trigger to generate apub URL in insert instead of update, and fix query planner options not being set when TLS is disabled (#4797)
fd58b4f Exponential controversy rank (#4872)
6d8d231 Adding an image_details table to store image dimensions. (#4704)

image_details is really the only major change between these versions, everything else are minor bug fixes or dependency upgrades.

@phiresky
Copy link
Collaborator

So we still have to investigate which commit is causing the slowness.

I wouldn't necessarily assume that a code change is the cause, it might just be lemmy in general handling recovering from downtime poorly (as in, if incoming federation gets hammered recovering from downtime, it might cause compounding slowness everywhere).

To test, you could shut down the instance for 30min or however long it was down before and just start it again on the same version, I would tentatively expect the same extra load.

One reason why I'm saying this is that people have been complaining about lemmy becoming "slower" after every upgrade for multiple releases and often it seems to just be temporary the next few hours after the upgrade.

@dessalines
Copy link
Member Author

dessalines commented Oct 1, 2024

I'm willing to try it again, as long as @Nutomic and someone else is available to help me to test. I don't think its federation, because I tried turning off federation, and it still wasn't usable.

But when I say the site was unusable, I mean that it was inaccessible to apps, and the web ui would only intermittently work.

78702b5 (the apub_url trigger changes) is the only one that sticks out to me that something could've gone wrong there.

@Nutomic
Copy link
Member

Nutomic commented Oct 18, 2024

One problem with 0.19.6 is with the migration from #3205. It takes a long time to recalculate all controversial scores. Once Lemmy starts again, postgres runs auto vacuum on the post_aggregates table (probably to regenerate the index). This is quite slow as it also needs to handle api requests at the same time. So maybe we should run vacuum as part of the migration, so it can use the full server cpu. And it would be good if the migration could filter some rows, eg posts with one or zero votes.

cc @dullbananas

@Nutomic
Copy link
Member

Nutomic commented Oct 18, 2024

But the main problem is that db queries are still slowing down extremely, so the site becomes unusable within a few minutes after startup. The slow queries are all annotated as PostQuery::list so the most likely cause is in that file. We already reverted #4797 and removed the join to image_details which was added in #4704. Neither made any difference. Shutting down the federation container didnt help either, so the problem is definitely caused by the api. But the remaining changes in post_view.rs look really trivial, so I dont know what else may be causing problems.

@dessalines
Copy link
Member Author

dessalines commented Oct 18, 2024

@phiresky
Copy link
Collaborator

i'm still not convinced it's related to any actual change rather than just a combination of the migration rewriting a table, destroying the in-memory page cache, and then the downtime causing the server to get hammered with federation requests

maybe just skip the controversial update? it's mostly eventually consistent anyways without the migration, no?

@Nutomic
Copy link
Member

Nutomic commented Oct 18, 2024

@phiresky Incoming federation would result in create and update queries, but the stats show only select queries at the top. Besides if lemmy.ml is down for half an hour then I believe it would take at least half an hour more for other instances to send activities again. But what we saw was no server load on startup, quickly ramping up to 100% server load within a minute. We also would have seen similar problems during previous upgrades, but those were fine.

Anyway if there are no better ideas we can try to make a beta without any migrations so its easy to revert. If that fails we need to bisect to find the problematic commit. Otherwise apply the commits with migrations one by one to see which one causes problems.

@Nutomic
Copy link
Member

Nutomic commented Oct 30, 2024

Weve narrowed it down now, the problem is caused by 33fd317. The only thing it does as db migration is increase the max length of post.url from 512 to 2000. From what I can find this field is only queried in post_view, both via search and to get crossposts via GET /api/v3/post.

My suspicion is that we are using the wrong index type for post.url (currently idx_post_url" btree (url) which is for ordering and pattern matching). But we only query for exact url match, so the index should be changed to hash.

Another possibility is that with the longer max length, postgres compresses some items or stores them indirectly. To prevent this we can set storage plain.

Edit: see #5148

@phiresky
Copy link
Collaborator

phiresky commented Oct 30, 2024

That's weird. I don't see anything suspicious in that commit... How did you narrow it down?

As far as I know, the max length of strings varchar(x) vs text is purely in validation, not in storage. The difference of whether toast tables or compression is used is done per-row depending on individual string length.

This is (kind of) confirmed in 8.3 Character Types:

For many purposes, character varying acts as though it were a domain over text.
There is no performance difference among these three types [...]
Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.

So if this does change perf characteristics, that should only be for newly inserted rows (that previously couldn't have been inserted)

so the index should be changed to hash.

That can't really be the issue either, btree is the default index type in postgresql and recommended for equality comparisons just fine. They used to not recommend hash index at all, but they've now improved it. Still, btree is fine, just a bit larger than hash index. Also, this is not something that has changed, so why would it be the issue?

@phiresky
Copy link
Collaborator

phiresky commented Oct 30, 2024

Sorry, I missed your above comment.

if lemmy.ml is down for half an hour then I believe it would take at least half an hour more for other instances to send activities again.

I don't think so - the initial back-off was a maximum of doubling downtime, but we've since changed the exponential backoff function to increase by 25% on every attempt, which I think means incoming federation should start back up after 0-5min after the restart (randomly different for each instance). See the simulation here: #4597 (comment) (Table 1.25**retries)

We also would have seen similar problems during previous upgrades, but those were fine.

True if the others had similar downtime

@Nutomic
Copy link
Member

Nutomic commented Oct 30, 2024

We had to make various beta releases to test all the different commits, see branches release/v0.19-no-migrations and release/v0.19-no-migrations-dess. beta.11 was still fine, but beta.12 had the slowdown problem. beta.13 is also fine, so that only leaves the commit with url max length change. In terms of Rust code it only changes some constants and tests, so the problem can only be the migration.

We should be able to reproduce the problem simply by running the alter table command directly, but then the only way to fix lemmy.ml performance is to restore a full sql backup (table vacuum didnt help). If its not the index and not the storage method, what else could be the problem?

@phiresky
Copy link
Collaborator

phiresky commented Oct 30, 2024

Mh. sounds like pretty clean testing. I'd still be very surprised if the actual issue is the varchar(x) change though.

One last ditch suggestion: It could be the missing statistics. Maybe the query plans postgresql generates after the migration are really bad because it does not know the distribution of the url column anymore:

image

You can try out running analyze posts after the migration. vacuum does not by itself do analyze, so maybe that's what you were missing?

but then the only way to fix lemmy.ml performance is to restore a full sql backup

You mean because even changing the type back does not fix it? Did you test that?

@dessalines
Copy link
Member Author

Yep, I tried changing the type back to varchar(512), and it still stayed slow. Doing a full table vacuum didn't help either.

@phiresky
Copy link
Collaborator

if it still stays slow then really the only thing I think it can be is missing stats, analyze post

@dessalines
Copy link
Member Author

I'll try analyze post today in a lemmy.ml prod DB on my local machine, and see if it makes a difference.

Nutomic added a commit that referenced this issue Nov 4, 2024
* Run analyze after changing post.url type (ref #4983)

* rename back

---------

Co-authored-by: Dessalines <[email protected]>
Nutomic added a commit that referenced this issue Nov 4, 2024
* Run analyze after changing post.url type (ref #4983)

* rename back

---------

Co-authored-by: Dessalines <[email protected]>
@dessalines
Copy link
Member Author

Fixed by the above commits. See #5148 for context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: database bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants