-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stdout and stderr handling can be extremely slow on Windows #117
Comments
I believe the reason reducing the pipe buffer size in |
bturner
added a commit
to bturner/NuProcess
that referenced
this issue
Sep 25, 2020
Using a large buffer size for the pipe connecting the child process can result in IOCP imposing (very short) delays before notifying the port when a read completes. For a small number of reads the short delays are no problem, but for processes that produce hundreds of megabytes of output all those short delays accumulate into a massive performance hit, reducing throughput compared to ProcessBuilder by an order of magnitude. Using a smaller pipe buffer does impose some constraint on how much data can be moved by a single ReadFile or WriteFile call but it also prevents delays (because there's no point in waiting for more input if the buffer is already full), resulting in overall improved performance. - Reduced WindowsProcess.BUFFER_SIZE from 64K to 4096+24, which matches the buffer size the JDK's ProcessImpl uses on Windows - Replaced System.err with a Logger when writing errors - This matches how LinuxProcess and OsxProcess are written - Optimized HANDLE.fromNative to check the pointer directly, to avoid creating HANDLE instances for invalid handles - This also avoids the base implementation's use of reflection to instantiate new HANDLE instances - Updated PipeBundle to set auto-sync to false on OVERLAPPED instances so JNA won't waste time marshaling to/from native code around every call to ReadFile or WriteFile
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When used to pump processes that produce lots of output in small chunks (which happens a lot with
git
commands--especiallygit http-backend
andgit upload-pack
, which are used to serve clones and fetches), the existing I/O completion handler code inProcessCompletions
runs afoul of how IOCP decides when to signal ready reads. When some data is available to read, IOCP waits (a very short wait) to see if more data arrives before triggering. For short-lived processes, or processes that produce their output in big chunks, that works fine. But if the process produces output in many, many tiny pieces, that extra delay amounts to a huge performance hit.This came up for Bitbucket Server in BSERV-12599, where it was reported that, since we switched over to NuProcess to run
git http-backend
andgit upload-pack
, hosting operations on Windows are an order of magnitude slower than they were usingProcessBuilder
and blocking I/O.To give a sense of scale, cloning a 500MB repository (Bitbucket Server's own source) via Bitbucket Server using
ProcessBuilder
looks like this:480MB at 34MB/s, with the entire operation taking about 25 seconds (the 34MB/s transfer is only part of the overall time).
Switching over to NuProcess completely tanks performance:
We've dropped from 34MB/s to 2MB/s, and the overall operation now takes over 3 minutes. For larger repositories the difference is even more painful, taking clones that previously ran in 30-60 seconds and blowing them out to 10-15 minutes. That results in stacking load on Bitbucket Server that eventually causes rejected requests due to excessive queuing.
I stripped out all Bitbucket Server's code and wrote a test in NuProcess that runs
git http-backend
directly, with the rightstdin
and environment to produce the same effective operation. (Unfortunately this test isn't really shareable because it relies on some cannedstdin
I captured, as well as access to a specific Git repository.) With that test, I'm able to reproduce the performance issue without any Bitbucket Server code at all. (It's worth noting that the test executed on Linux or macOS performs fine, with NuProcess speeds essentially identical toProcessBuilder
.)In trying to track down the issue, I looked through the JDK's source and found they use
4096 + 24
byte buffers for their pipes. ChangingWindowsProcess.BUFFER_SIZE
from 64K to4096 + 24
fixes the issue and produces identical throughput with NuProcess compared toProcessBuilder
.A colleague helping me search for this found some other cases where IOCP's Nagle-like approach has caused problems:
The text was updated successfully, but these errors were encountered: