-
-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Table locks / slow queries on 0.19.6
betas
#4983
Comments
what's the time frame the screenshot data is from? purely after the upgrade? how long did it run? Could just be due to the general DB overload though. |
One more thing, after a 30min+ downtime the instance will get hammered by incoming federation queries trying to catch the instance up to the current state of the network. Since our incoming federation is not limited, it will be implicitly only limited by resource limits (CPU+DB), which might also appear as general load and cause everything to slow down. So if you're looking at perf issues, make sure that you only start measuring once the federation state is up to date for all incoming instances. |
I turned this on ~ 20 minutes after the migrations finished, and startup, after I saw things were going slow. I let it run for maybe 5-10m.
I checked the federation queue using your site after this happened, and it was up to date. iirc I also tried turning off that separate dedicated federation docker-container, and it was still slow, so its pry not federation related. This is gonna be a tough one to solve, and we probably need to look at changes to the DB and the post query list function since |
That container only handles outgoing federation, the incoming federation which phiresky mentioned goes through the main container which handles http requests. |
Yeah, it's a bit harder to check incoming federation state which is the important part here. Outgoing federation will be idle after downtime. A query like |
Is this issue still relevant? Weve been running 0.19.6-beta1 on lemmy.ml for a while now, and I havent noticed any problems. This is also the only issue remaining inthe 0.19.6 milestone, so once its closed we can publish the new release. |
Yes, beta1 doesn't have any of the DB changes, only that one specific federation commit. So we still have to investigate which commit is causing the slowness. |
The changes to post_view.rs are really trivial so that cant be the problem:
And for triggers theres only a single change which also looks very simple: And migrations: image_details is really the only major change between these versions, everything else are minor bug fixes or dependency upgrades. |
I wouldn't necessarily assume that a code change is the cause, it might just be lemmy in general handling recovering from downtime poorly (as in, if incoming federation gets hammered recovering from downtime, it might cause compounding slowness everywhere). To test, you could shut down the instance for 30min or however long it was down before and just start it again on the same version, I would tentatively expect the same extra load. One reason why I'm saying this is that people have been complaining about lemmy becoming "slower" after every upgrade for multiple releases and often it seems to just be temporary the next few hours after the upgrade. |
I'm willing to try it again, as long as @Nutomic and someone else is available to help me to test. I don't think its federation, because I tried turning off federation, and it still wasn't usable. But when I say the site was unusable, I mean that it was inaccessible to apps, and the web ui would only intermittently work. 78702b5 (the apub_url trigger changes) is the only one that sticks out to me that something could've gone wrong there. |
One problem with 0.19.6 is with the migration from #3205. It takes a long time to recalculate all controversial scores. Once Lemmy starts again, postgres runs auto vacuum on the post_aggregates table (probably to regenerate the index). This is quite slow as it also needs to handle api requests at the same time. So maybe we should run vacuum as part of the migration, so it can use the full server cpu. And it would be good if the migration could filter some rows, eg posts with one or zero votes. cc @dullbananas |
But the main problem is that db queries are still slowing down extremely, so the site becomes unusable within a few minutes after startup. The slow queries are all annotated as |
Some stats: Diff to look at: 0.19.5...0.19.6-beta.9-no-triggers |
i'm still not convinced it's related to any actual change rather than just a combination of the migration rewriting a table, destroying the in-memory page cache, and then the downtime causing the server to get hammered with federation requests maybe just skip the controversial update? it's mostly eventually consistent anyways without the migration, no? |
@phiresky Incoming federation would result in create and update queries, but the stats show only select queries at the top. Besides if lemmy.ml is down for half an hour then I believe it would take at least half an hour more for other instances to send activities again. But what we saw was no server load on startup, quickly ramping up to 100% server load within a minute. We also would have seen similar problems during previous upgrades, but those were fine. Anyway if there are no better ideas we can try to make a beta without any migrations so its easy to revert. If that fails we need to bisect to find the problematic commit. Otherwise apply the commits with migrations one by one to see which one causes problems. |
Weve narrowed it down now, the problem is caused by 33fd317. The only thing it does as db migration is increase the max length of post.url from 512 to 2000. From what I can find this field is only queried in post_view, both via search and to get crossposts via GET /api/v3/post. My suspicion is that we are using the wrong index type for post.url (currently Another possibility is that with the longer max length, postgres compresses some items or stores them indirectly. To prevent this we can Edit: see #5148 |
That's weird. I don't see anything suspicious in that commit... How did you narrow it down? As far as I know, the max length of strings This is (kind of) confirmed in 8.3 Character Types:
So if this does change perf characteristics, that should only be for newly inserted rows (that previously couldn't have been inserted)
That can't really be the issue either, btree is the default index type in postgresql and recommended for equality comparisons just fine. They used to not recommend |
Sorry, I missed your above comment.
I don't think so - the initial back-off was a maximum of doubling downtime, but we've since changed the exponential backoff function to increase by 25% on every attempt, which I think means incoming federation should start back up after 0-5min after the restart (randomly different for each instance). See the simulation here: #4597 (comment) (Table
True if the others had similar downtime |
We had to make various beta releases to test all the different commits, see branches release/v0.19-no-migrations and release/v0.19-no-migrations-dess. beta.11 was still fine, but beta.12 had the slowdown problem. beta.13 is also fine, so that only leaves the commit with url max length change. In terms of Rust code it only changes some constants and tests, so the problem can only be the migration. We should be able to reproduce the problem simply by running the alter table command directly, but then the only way to fix lemmy.ml performance is to restore a full sql backup (table vacuum didnt help). If its not the index and not the storage method, what else could be the problem? |
Mh. sounds like pretty clean testing. I'd still be very surprised if the actual issue is the varchar(x) change though. One last ditch suggestion: It could be the missing statistics. Maybe the query plans postgresql generates after the migration are really bad because it does not know the distribution of the url column anymore: You can try out running
You mean because even changing the type back does not fix it? Did you test that? |
Yep, I tried changing the type back to |
if it still stays slow then really the only thing I think it can be is missing stats, |
I'll try |
* Run analyze after changing post.url type (ref #4983) * rename back --------- Co-authored-by: Dessalines <[email protected]>
* Run analyze after changing post.url type (ref #4983) * rename back --------- Co-authored-by: Dessalines <[email protected]>
Fixed by the above commits. See #5148 for context. |
Requirements
Summary
Earlier tonight I tried to deploy
0.19.6-beta.6
to lemmy.ml, after having tested various versions of it on voyager.lemmy.ml for a few weeks. Post queries start stalling out pretty quickly, and it becomes unusable.I didn't think we changed anything major with the post queries, so this could be trigger related, or something to do with the site and community aggregates causing locks.
Also the controversial migration does take ~30m and locks up things, but I spose that's unavoidable, and not too big a deal since its only run once.
I turned on pg_stat_statements and got this:
For now I restored lemmy.ml from the backup I took before.
cc @dullbananas @phiresky @Nutomic
Steps to Reproduce
NA
Technical Details
NA
Version
0.19.6
Lemmy Instance URL
voyager.lemmy.ml
The text was updated successfully, but these errors were encountered: