-
-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trio run loop generates cyclic garbage, which can cause more frequent GC stalls on CPython #1770
Comments
observation: the "lost time" seems to be relatively constant in each of the import cases when looking a larger time scale:
|
You might note that trio uses Thus depending on how much work you do (or the kernel thinks you're doing) sooner or later the math of rounding those timers cause Trio to trigger a 1-ms scheduling step at some point. Whatever the kernel does with that is up to the kernel; its scheduling decisions do depend on how often your process has stalled in the past, which means that if it had more work to do at startup (like, loading numba) it'll get treated differently. If you need high-resolution timers, use a Data point: on my system the max delta is 5ms no matter what I tweak. |
He says the time is disappearing while tasks are stepping though, not when trio is blocking for io. One obvious thing to rule out would be GC pauses. You'll want to check if calling |
Huh. I totally missed that. In my test |
Ah, I even have a theory about why
If the full-collection GC is in fact the problem, then I guess there are a number of ways you could try tweaking it.
You can use |
thank you-- no idea how I made it this far being unaware of the Python GC so it makes sense that this is easier to observe at high CPU use-- otherwise the full cycle might be scheduled when the process is idle still a mystery: why I couldn't reproduce it on asyncio or time.sleep() |
@belm0 The GC only considers doing a full collection after 700 * 10 * 10 = 70,000 objects have been allocated without being deallocated. Presumably your |
Hmm, yeah, running your script with |
I've measured a fairly consistent 35 garbage objects per import gc
import trio
gc.set_debug(gc.DEBUG_SAVEALL)
T = 5
N = 3000
async def _run():
garbage_start = len(gc.garbage)
for _ in range(N):
await trio.sleep(T/N)
garbage_delta = len(gc.garbage) - garbage_start
print(f'garbage rate: {garbage_delta / T:.1f}')
print(f'garbage per sleep: {garbage_delta / N:.1f}')
trio.run(_run)
|
if i == 10:
gc.collect()
gc.set_debug(gc.DEBUG_SAVEALL)
assert not gc.garbage
await trio.sleep(T/N)
if i == 10:
gc.collect()
print('\n'.join(repr(x) for x in gc.garbage))
gc.set_debug(0) here are the 36 garbage items generated by one
|
right, so the next step is to poke around in that object graph and figure
out where these objects are coming from and where the cycles are.
…On Wed, Oct 28, 2020 at 4:58 AM John Belmonte ***@***.***> wrote:
here are the 36 garbage items generated by one sleep(0) call:
Error(Cancelled())
Cancelled()
<traceback object at 0x10dda84b0>
<traceback object at 0x10dda8460>
<frame at 0x10dda5b90, file '/.../site-packages/trio/_timeouts.py', line 56, code sleep_until>
<function sleep_forever.<locals>.<lambda> at 0x10dda3680>
<traceback object at 0x10dda8410>
<frame at 0x10dda4c20, file '/.../site-packages/trio/_timeouts.py', line 40, code sleep_forever>
<traceback object at 0x10dda8370>
<frame at 0x10dda4dd0, file '/.../site-packages/trio/_core/_traps.py', line 166, code wait_task_rescheduled>
<traceback object at 0x10dda8320>
<frame at 0x10ddb0050, file '/.../site-packages/outcome/_sync.py', line 111, code unwrap>
<frame at 0x10dda26d0, file '/.../site-packages/trio/_core/_run.py', line 1167, code raise_cancel>
<frame at 0x10dda7250, file '/.../site-packages/outcome/_async.py', line 23, code capture>
<frame at 0x10dda6960, file '/.../site-packages/trio/_core/_run.py', line 1158, code _attempt_abort>
<function Task._attempt_delivery_of_any_pending_cancel.<locals>.raise_cancel at 0x10dda37a0>
<frame at 0x10dd9e910, file '/.../site-packages/trio/_core/_run.py', line 1169, code _attempt_delivery_of_any_pending_cancel>
<frame at 0x10dda6790, file '/.../site-packages/trio/_core/_run.py', line 358, code recalculate>
<frame at 0x10ddaa050, file '/.../site-packages/trio/_core/_run.py', line 681, code cancel>
CancelStatus(_scope=<trio.CancelScope at 0x10dda0590, exited, cancelled>, effectively_cancelled=True)
[]
<frame at 0x10dda65c0, file '/.../site-packages/trio/_core/_ki.py', line 159, code wrapper>
<trio.CancelScope at 0x10dda0590, exited, cancelled>
set()
set()
<frame at 0x10dda5d70, file '/.../site-packages/trio/_core/_run.py', line 197, code expire>
{'args': (<trio.CancelScope at 0x10dda0590, exited, cancelled>,), 'kwargs': {}, 'enabled': True, 'fn': <function CancelScope.cancel at 0x10db35710>, ***@***.***_KI_PROTECTION_ENABLED': True}
(<trio.CancelScope at 0x10dda0590, exited, cancelled>,)
<bound method CancelScope._exc_filter of <trio.CancelScope at 0x10dd3f750, exited, cancelled>>
<cell at 0x10cef3210: function object at 0x10dd7d950>
<cell at 0x10cef3150: method object at 0x10db29500>
<cell at 0x10cef3250: function object at 0x10dda3710>
(<cell at 0x10cef3210: function object at 0x10dd7d950>, <cell at 0x10cef3150: method object at 0x10db29500>)
<function _filter_impl.<locals>.filter_tree at 0x10dd7d950>
(<cell at 0x10cef3250: function object at 0x10dda3710>,)
<function _filter_impl.<locals>.push_tb_down at 0x10dda3710>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1770 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEU42BV42UM3QHJRQW3GK3SNABPFANCNFSM4S6WJOYA>
.
--
Nathaniel J. Smith -- https://vorpus.org <http://vorpus.org>
|
This avoids invoking the cycle collector as often; see python-trio/trio#1770 for more details.
When put together, this change + python-trio/outcome#29 should fix python-trio#1770
This avoids invoking the cycle collector as often; see python-trio/trio#1770 for more details.
This avoids invoking the cycle collector as often; see python-trio/trio#1770 for more details.
copying new garbage cases from the PR discussion: these are with existing Trio and outcome fixes applied cancelled background task: 40 itemsasync with trio.open_nursery() as nursery:
nursery.start_soon(trio.sleep_forever)
await trio.sleep(T/N) # a little less garbage if this is removed
nursery.cancel_scope.cancel() cancelled empty nursery: 25 itemsasync with trio.open_nursery() as nursery:
nursery.cancel_scope.cancel() |
When put together, this change + python-trio/outcome#29 should fix python-trio#1770
symptom
at a regular period (e.g. 10s or 20s), there are significant stalls (e.g. 30 ms) where it appears the main thread is pre-empted
In terms of
trio.abc.Instrument
, the lost time appears betweenbefore_task_step
andafter_task_step
, though tasks are doing nothing with threads or time.sleep(). In a single-threaded app with many heterogeneous tasks, the stall events appear to be distributed randomly among the tasks.I've come up with a minimal program to reproduce the effect. Note that it is CPU-intensive (a loop of very small
trio.sleep()
), as CPU use > 70% seems necessary to induce the problem.Note that the repro imports
numba
, and that is the spooky part. With the program as written, observed output (timestamp, stall in ms):However, removing the
numba
import, the stall durations are significantly reduced (6 ms) and at a different period (3 ms):And here is importing
numpy
instead ofnumba
(10 ms stalls, 4ms period):conditions
seen in Trio 0.17.0
it's observed by multiple people running the repro, across multiple architectures (at least Linux/amd64 and and OS X)
So far it's specific to Trio. I've tried equivalent loops using
asyncio
andtime.sleep()
, which do not exhibit the problem.original bug filed to numba: numba/numba#6347
theories
next steps
The text was updated successfully, but these errors were encountered: