Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Unbounded increases in GC time #7176

Closed
tulir opened this issue Mar 30, 2020 · 23 comments
Closed

Unbounded increases in GC time #7176

tulir opened this issue Mar 30, 2020 · 23 comments
Labels
A-Performance Performance, both client-facing and admin-facing Z-Upstream-Bug This issue requires a fix in an upstream dependency.

Comments

@tulir
Copy link
Member

tulir commented Mar 30, 2020

Description

After updating to 1.12.0, average event send times have gone up. Metrics are showing GC time going up a lot more than before, around 1 second per day. Memory usage hasn't changed significantly though. @lub:imninja.net's metrics showed the same problem when I asked in #synapse:matrix.org.

The problem did not happen with the 1.12 release candidate, which leads me to believe it could have something to do with the Twisted update.

image
image
image

Version information

@lub
Copy link
Contributor

lub commented Mar 31, 2020

image

Although I don't see such a clear difference in other metrics, but maybe I just have less load

image

I use matrixdotorg/synapse:latest. Also had the rc1 on release day, but it seems it didn't cause that behaviour

@clokep clokep added the A-Performance Performance, both client-facing and admin-facing label Apr 2, 2020
@tulir
Copy link
Member Author

tulir commented Apr 8, 2020

I tried downgrading to the previous Twisted version, but that didn't seem to help, so I guess it wasn't that after all.

New graphs

image
image
image

For comparison, here are 30 day graphs from before 1.12. The GC time still went up and didn't come down, but the growth was much slower:

Graphs from before 1.12

image
image
image

@richvdh
Copy link
Member

richvdh commented Apr 8, 2020

It's hard to see why GC time would increase when memory usage and the number of objects being collected stays the same.

The only change in 1.12.0 which is vaguely related to GC is #6953. Could you patch out that change and see if it helps at all?

@casimir
Copy link

casimir commented Apr 8, 2020

After something like a week of uptime synapse managed to use all the RAM on my server. A restart later (a little after 12:00) all was smooth again. I can provide more precise information if needed.

image

image

image

I am the only user on my server and I am in a few big rooms. Also I only started sending messages today, it was on #synapse:matrix.org

@tulir
Copy link
Member Author

tulir commented Apr 9, 2020

I think I reverted it in https://mau.dev/maunium/synapse, but still seems to grow quickly:

image

@richvdh
Copy link
Member

richvdh commented Apr 9, 2020

@tulir yeah; it was a long shot tbh. I have no real idea what could cause this. Are you sure it didn't happen with 1.12.0rc1 ?

It feels like something has changed outside of synapse. I don't suppose you've had a python version change or something?

I guess it could just reflect a different traffic pattern from your clients/federated servers.

@twoequaldots
Copy link

I am seeing similar issues, at first I thought it might have been related to open file handles and/or use_presence: true (now false) in our configuration but those only seemed to help a bit. The only way around this is periodically restarting matrix-synapse.

I was starting to suspect maybe the version of Python we are running may not be good for the Twisted update (we are on Debian 9 with Python v3.5.3 and matrix-synapse-py3 on 1.12.3).

Unfortunately, we weren't scraping Prometheus metrics until we got onto the 1.12.x branch so we cannot compare GC collection time from a prior release.

@deepbluev7
Copy link
Contributor

So, my gc times are still growing linearly:

image

It seems mostly the federation reader and federation sender are affected. My caches are already full and are not growing anymore:

image

But it seems like memory usage for the federation workers is still climbing:

image

This may suggest, that there is a resource leak regarding DNS or connections or similar. Could of course be something else, but I would expect the memory usage of the federation sender to flatten off after a few days. In any case, the GC times shouldn't grow linearly over time, which they do at the moment.

@richvdh
Copy link
Member

richvdh commented Sep 7, 2020

as I've said previously, I don't think this is a regression: it's more likely that there was some change in the federation which coincided with the upgrade and which made synapse want to chew more memory.

We're not seeing any sort of leaking on the servers we administer, so I guess the issues being seen here are either due to misconfigured caches, or due to some strange interaction of environment/configuration.

It is certainly true that synapse can be a memory hog, and it's a bit of a black-box in terms of what it is doing. Making the caching more intelligent is on the medium-term roadmap, and as part of that work I'd like to add some analytics which reveal how much memory is being used in caches (and how much elsewhere).

However, in the shorter term, it's hard for us to do much to investigate individual cases of unexpectedly high memory usage. The only thing I can really suggest is that folks have a play with python memory-profiling tools to see if you can find out where the memory is being used.

@deepbluev7
Copy link
Contributor

@richvdh How often are you restarting your federation workers? In my case I can simply plot a line through my increased GC times and memory usage for my federation workers. This is while cache size is up to its max value after a few hours, the GC increases are over the period of a few days. A week is pretty much the limit, since at some point my server simply runs out of memory. I can work around that by restarting my synapse/workers every few days. Of the people I've asked, about half of them see the same pattern. Some don't, but many of them don't even have metrics enabled. So while this may not affect everyone, it does at least affect a few people with very different configurations.

I also wouldn't jump to this being cache related. As I said, my caches don't grow after a certain size, so this can be any kind of resource leak or it is an unbounded cache. I'll see, if I can figure something out with a memory profiler at some point. I'm not sure, if this is a regression or has only become noticeable recently, but unbounded increase in memory use and GC times does sound like a bug to me.

@richvdh richvdh changed the title Memory/GC time regression Unbounded increases in GC time Sep 7, 2020
@richvdh
Copy link
Member

richvdh commented Sep 7, 2020

@richvdh How often are you restarting your federation workers?

we have no problems with running them for up to a couple of weeks. There's no particular change in GC time over this period.

If its limited to federation senders, it could be #7828 which has a fix in the forthcoming v1.20.0 release.

@deepbluev7
Copy link
Contributor

If its limited to federation senders, it could be #7828 which has a fix in the forthcoming v1.20.0 release.

Yeah, I think I linked to a similar issue before. We'll see, the graphs don't match up exactly, but both are growing, that is true.

If you can run you workers for a few weeks, then it really seems to be system specific. Usually synapse updates tend to restart my workers often enough, that memory use doesn't become an issue, so I thought that may mask the effects.

@deepbluev7
Copy link
Contributor

I'm fairly certain now, that #7828 is unrelated, after running the 1.20 rcs for a while now (unless the fix didn't actually land yet):

grafik

Didn't have the time to profile yet though.

@richvdh
Copy link
Member

richvdh commented Sep 25, 2020

I think I'm actually seeing something similar to this on my personal HS:

image

I'll have a poke about to see if I can figure out what's going on.

@richvdh
Copy link
Member

richvdh commented Sep 26, 2020

I've found some interesting things:

Firstly, as time goes by, there is an accumulation of objects being tracked by the garbage collector, as returned by len(gc.get_objects()):

image

This is consistent with GC taking more and more time, as the GC must inspect each of those objects on each gen2 garbage collection.

The majority of these objects are tuples:

>>> objgraph.show_most_common_types(5,gc.get_objects())
tuple       879076
dict        73675
set         62514
list        60363
ProfileInfo 43990

And more interestingly still, objgraph fails to find any reference to most of those tuples:

>>> unreffed = objgraph.get_leaking_objects()
>>> sum(1 for t in unreffed if isinstance(t, tuple))
844957

Of those leaking tuples, they are almost all 1- or 2-tuples:

>>> collections.Counter(len(t) for t in unreffed if isinstance(t, tuple))
Counter({1: 531264, 2: 313618, 3: 30, 4: 26, 5: 9, 7: 3, 6: 2, 8: 1, 9: 1, 11: 1, 13: 1, 15: 1})

the 1-tuples almost all contain a zope.interface.declarations.Implements:

>>> onetups = [t for t in unreffed if isinstance(t, tuple) and len(t) == 1]
>>> implonetups = [t for t in onetups if isinstance(t[0], zope.interface.declarations.Implements)]
>>> len(implonetups)
531032

and the 2-tuples are almost all (zope.interface.interface.InterfaceClass, zope.interface.declarations.Implements) pairs:

>>> impltwotups = [
...     t
...     for t in twotups
...     if isinstance(t[0], zope.interface.interface.InterfaceClass)
...     and isinstance(t[1], zope.interface.declarations.Implements)
... ]
>>> len(impltwotups)
313596

so, to recap, of the 1.4M objects being tracked by the garbage collector in my synapse process, 845000 of them (60%) are tuples containing zope.interface.declarations.Implements objects which appear to be unreferenced by any other python object. This suggests some sort of leak in the C parts of zope.interface, though it's also possible that we or Twisted are misusing zope in some way that causes this.

@richvdh
Copy link
Member

richvdh commented Sep 27, 2020

It's maybe a little early to say, but disabling zope's C extensions (by moving _zope_interface_coptimizations.cpython-35m-x86_64-linux-gnu.so aside) seems to stop the leak, or at least significantly change the behaviour:

About 15 hours after restart:

>>> len(gc.get_objects())
540302
>>> unreffed = objgraph.get_leaking_objects()
>>> objgraph.show_most_common_types(5,unreffed)
dict   1696
list   949
CType  103
tuple  95
method 46

Next step is to see if I can repro this behaviour outside of synapse, I think.

@tulir
Copy link
Member Author

tulir commented Sep 27, 2020

Downgrading to zope.interface<5 also seems to solve it. The release date of v5 and this issue starting match pretty much exactly

image

@richvdh
Copy link
Member

richvdh commented Sep 27, 2020

@tulir aha, seems we have a culprit then. Will continue investigating and raise an issue with zope once I have more details.

@richvdh
Copy link
Member

richvdh commented Sep 27, 2020

@clokep clokep added the Z-Upstream-Bug This issue requires a fix in an upstream dependency. label Sep 29, 2020
@clokep
Copy link
Member

clokep commented Sep 30, 2020

Looks like zope released a 5.1.1 with the fix. Do we want to do anything about this? (Blacklist versions or anything?) Or is this "fixed" from our POV?

@richvdh
Copy link
Member

richvdh commented Sep 30, 2020

I'm just running with updated zope to confirm. Assuming it works I don't think we need to go as far as blacklisting versions; the most we might want to do is suggest in a changelog or somewhere that people upgrade.

@richvdh
Copy link
Member

richvdh commented Oct 1, 2020

Yup looks good to me. Thanks to @jamadden for jumping on this from the zope side.

@richvdh richvdh closed this as completed Oct 1, 2020
@jamadden
Copy link

jamadden commented Oct 1, 2020

Thanks @richvdh and everyone here for finding and reporting this! Apologies for the trouble it caused.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing Z-Upstream-Bug This issue requires a fix in an upstream dependency.
Projects
None yet
Development

No branches or pull requests

8 participants