Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance tuning a simple shell pipeline (are we missing SIGPIPE?) #373

Closed
rw opened this issue Mar 13, 2017 · 10 comments
Closed

Performance tuning a simple shell pipeline (are we missing SIGPIPE?) #373

rw opened this issue Mar 13, 2017 · 10 comments
Labels

Comments

@rw
Copy link

rw commented Mar 13, 2017

I'm trying to convert a bash script to use sh, version sh==1.12.10. Here's an example command that I'm trying:

sh.head(sh.cat('/usr/share/dict/words', _piped='direct'))

The above takes about 2.81 seconds on my system (timed with %timeit in an IPython shell).

When running in bash, using the command time (cat /usr/share/dict/words | head), it is of course much quicker: 4 milliseconds.

Am I using sh correctly? The docs don't seem to cover troubleshooting pipeline performance issues.

@rw
Copy link
Author

rw commented Mar 13, 2017

My best workaround is to do:

sh.sh('-c', 'cat /usr/share/dict/words | head')

But according to %timeit that's still much slower at 15ms per iteration.

@amoffat
Copy link
Owner

amoffat commented Mar 13, 2017

So _piped='direct' isn't a thing anymore. Direct piping is now the default, so you just want what the docs say is allowed... True or "out". Actually, using "direct" is causing piping to not work at all (a side bug maybe), since it's not a valid value, which accounts for your slowdown.

However, when you use _piped=True, cat immediately fails with a broken pipe, because sh.head hasn't started yet. cat seems to be the only program that has this issue, since this works fine:

sh.head(sh.head('/usr/share/dict/words', _piped=True, n=10000))

Not sure if this is a bug with sh

@amoffat
Copy link
Owner

amoffat commented Mar 13, 2017

Also as you can probably see from sh.head(sh.head(file you can just use sh.head once on your file :)

@rw
Copy link
Author

rw commented Mar 13, 2017

Thanks for the quick response. Are you using master? I ran your example and I get a SIGPIPE exception:

head: error writing 'standard output': Broken pipe

@amoffat
Copy link
Owner

amoffat commented Mar 13, 2017

I'm on 1.12.10. Running it a few times now fails for me sometimes. If I change it to n=100000 it fails consistently. This looks like a timing issue with sh. I'll need to dig into it further.

@amoffat
Copy link
Owner

amoffat commented Mar 14, 2017

Ok, this issue should be fixed in 1.12.11. Basically the problem was related to SIGPIPE as you originally suggested. Python sets SIGPIPE to SIG_IGN on startup, so spawned processes were ignoring SIGPIPE. But they were still dying (from error code 1) from errors when a write reported EPIPE.

However, a race existed because sometimes the piping source process finished before the piping destination process, so there never was a "hang up" on the fd (and therefore never an EPIPE)...the data just stayed in the pipe buffer until the destination process could read from it. I imagine that's why the tests never caught it. Your sample code caught it because the cat was longer lived than the head and the head didn't consume all the data.

The fix was to make sure spawned processes saw SIGPIPE, and then suppress any exception generated by _piped processes that received a SIGPIPE.

Anyways good find, confirm that it works for you and let me know.

@rw
Copy link
Author

rw commented Mar 14, 2017

Works great, thanks for fixing this so quickly. I'm glad my hunch about SIGPIPE was useful :-)

@rw
Copy link
Author

rw commented Mar 14, 2017

Note that sh takes about double the time of invoking the same command with just bash. In the following example, I try to minimize overhead by doing the command resolution only once:

import sh
head = sh.head
cat = sh.cat
%timeit -n 100 -r 5 head(cat('/usr/share/dict/words', _piped=True))
100 loops, best of 5: 30.8 ms per loop

And, here, I use the aforementioned kludge of invoking the entire command in a single sh call:

import sh
shell = sh.sh
%timeit -n 100 -r 5 shell('-c', 'cat /usr/share/dict/words | head')
100 loops, best of 5: 17.2 ms per loop

Finally, here it is running in bash:

$ time sh -c 'cat /usr/share/dict/words | head'
A
a
aa
aal
aalii
aam
Aani
aardvark
aardwolf
Aaron

real    0m0.011s
user    0m0.004s
sys     0m0.007s

@amoffat
Copy link
Owner

amoffat commented Mar 14, 2017

Does that performance gap shrink the longer the processes run? Try piping to head -n 50000

@rw
Copy link
Author

rw commented Mar 14, 2017

The difference doesn't shrink on my system as n grows.

More importantly, thanks for fixing this, it's much better!

0-wiz-0 added a commit to NetBSD/pkgsrc-wip that referenced this issue Apr 19, 2017
*   pypi readme doc bugfix [PR#377](amoffat/sh#377)

*   bugfix for relative paths to `sh.Command` not expanding to absolute paths [#372](amoffat/sh#372)
*   updated for python 3.6
*   bugfix for SIGPIPE not being handled correctly on pipelined processes [#373](amoffat/sh#373)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants