Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait longer for messages to arrive in the queue #86

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mdkent
Copy link

@mdkent mdkent commented Jul 30, 2021

Naive fix for an issue we're seeing here in trying to rbtrace some puma workers where it seems bin/rbtrace is triggering the signal_handler_wrapper a bit too quickly, and it runs through all the msgrcv() retries before the message actually arrives from bin/rbtrace. This results in:

a@b:/u/apps/c/current$ bundle exec rbtrace -p 7160 -e "1+1"
Error: argument --pid (pid is not listening for messages, did you `require "rbtrace"`).
Try --help for help.

Here's an strace of the bin/rbtrace side:

16:42:11.312872 msgsnd(393220, {1, "\222\246attach\315\237\254\<snip>

and an strace of the puma worker:

[pid 21689] 16:42:11.288543 msgrcv(393220, 0x7fba024fd410, 256, 0, IPC_NOWAIT) = -1 ENOMSG (No message of desired type)
[pid 21689] 16:42:11.288637 msgrcv(393220, 0x7fba024fd410, 256, 0, IPC_NOWAIT) = -1 ENOMSG (No message of desired type)
[pid 21689] 16:42:11.288719 msgrcv(393220, 0x7fba024fd410, 256, 0, IPC_NOWAIT) = -1 ENOMSG (No message of desired type)
[pid 21689] 16:42:11.288806 msgrcv(393220, 0x7fba024fd410, 256, 0, IPC_NOWAIT) = -1 ENOMSG (No message of desired type)
[pid 21689] 16:42:11.288893 msgrcv(393220, 0x7fba024fd410, 256, 0, IPC_NOWAIT) = -1 ENOMSG (No message of desired type)
[pid 21689] 16:42:11.288974 msgrcv(393220, 0x7fba024fd410, 256, 0, IPC_NOWAIT) = -1 ENOMSG (No message of desired type)
[pid 21689] 16:42:11.289057 msgrcv(393220, 0x7fba024fd410, 256, 0, IPC_NOWAIT) = -1 ENOMSG (No message of desired type)
[pid 21689] 16:42:11.289138 msgrcv(393220, 0x7fba024fd410, 256, 0, IPC_NOWAIT) = -1 ENOMSG (No message of desired type)
[pid 21689] 16:42:11.289225 msgrcv(393220, 0x7fba024fd410, 256, 0, IPC_NOWAIT) = -1 ENOMSG (No message of desired type)
[pid 21689] 16:42:11.289315 msgrcv(393220, 0x7fba024fd410, 256, 0, IPC_NOWAIT) = -1 ENOMSG (No message of desired type)

Note the timestamp, the reads all happen before the message queue is written to.

Here's the puma worker side with this new wait, we get 3 msgrcv retries with the 100ms sleep:

[pid 16328] 05:53:22.789390 msgget(0xffffc053, IPC_CREAT|0666) = 1703938
[pid 16328] 05:53:22.789458 socket(AF_UNIX, SOCK_DGRAM, 0) = 26
[pid 16328] 05:53:22.789529 setsockopt(26, SOL_SOCKET, SO_SNDBUF, [65536], 4) = 0
[pid 16328] 05:53:22.789599 msgrcv(1703938, 0x7f2ada4f4fc0, 256, 0, IPC_NOWAIT) = -1 ENOMSG (No message of desired type)
[pid 16328] 05:53:22.789667 nanosleep({tv_sec=0, tv_nsec=100000000}, NULL) = 0
[pid 16328] 05:53:22.889931 msgrcv(1703938, 0x7f2ada4f4fc0, 256, 0, IPC_NOWAIT) = -1 ENOMSG (No message of desired type)
[pid 16328] 05:53:22.890079 nanosleep({tv_sec=0, tv_nsec=100000000},  <unfinished ...>
[pid 16301] 05:53:22.931987 <... ppoll resumed> ) = ? ERESTARTNOHAND (To be restarted if no handler)
[pid 16301] 05:53:22.932061 --- SIGURG {si_signo=SIGURG, si_code=SI_USER, si_pid=18720, si_uid=1003} ---
[pid 16301] 05:53:22.932157 ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = ? ERESTARTNOHAND (To be restarted if no handler)
[pid 16301] 05:53:22.982165 --- SIGURG {si_signo=SIGURG, si_code=SI_USER, si_pid=18720, si_uid=1003} ---
[pid 16301] 05:53:22.982324 ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8 <unfinished ...>
[pid 16328] 05:53:22.990246 <... nanosleep resumed> NULL) = 0
[pid 16328] 05:53:22.990307 msgrcv(1703938, {1, "\222\246attach\315I<snip>

As I mentioned this is a naive fix to get some debugging working, but I thought it was worth sharing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant