Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stdout and stderr handling can be extremely slow on Windows #117

Closed
bturner opened this issue Sep 25, 2020 · 1 comment
Closed

Stdout and stderr handling can be extremely slow on Windows #117

bturner opened this issue Sep 25, 2020 · 1 comment

Comments

@bturner
Copy link
Collaborator

bturner commented Sep 25, 2020

When used to pump processes that produce lots of output in small chunks (which happens a lot with git commands--especially git http-backend and git upload-pack, which are used to serve clones and fetches), the existing I/O completion handler code in ProcessCompletions runs afoul of how IOCP decides when to signal ready reads. When some data is available to read, IOCP waits (a very short wait) to see if more data arrives before triggering. For short-lived processes, or processes that produce their output in big chunks, that works fine. But if the process produces output in many, many tiny pieces, that extra delay amounts to a huge performance hit.

This came up for Bitbucket Server in BSERV-12599, where it was reported that, since we switched over to NuProcess to run git http-backend and git upload-pack, hosting operations on Windows are an order of magnitude slower than they were using ProcessBuilder and blocking I/O.

To give a sense of scale, cloning a 500MB repository (Bitbucket Server's own source) via Bitbucket Server using ProcessBuilder looks like this:

C:\Temp>git clone --no-checkout http://localhost:7990/bitbucket/scm/at/bitbucket-server.git
Cloning into 'bitbucket-server'...
remote: Enumerating objects: 2304995, done.
remote: Counting objects: 100% (2304995/2304995), done.
remote: Compressing objects: 100% (779774/779774), done.
remote: Total 2304995 (delta 1058323), reused 2304995 (delta 1058323), pack-reused 0
Receiving objects: 100% (2304995/2304995), 480.51 MiB | 33.96 MiB/s, done.
Resolving deltas: 100% (1058323/1058323), done.

480MB at 34MB/s, with the entire operation taking about 25 seconds (the 34MB/s transfer is only part of the overall time).

Switching over to NuProcess completely tanks performance:

C:\Temp>git clone --no-checkout http://localhost:7990/bitbucket/scm/at/bitbucket-server.git
Cloning into 'bitbucket-server'...
remote: Enumerating objects: 2304995, done.
remote: Counting objects: 100% (2304995/2304995), done.
remote: Compressing objects: 100% (779774/779774), done.
remote: Total 2304995 (delta 1058323), reused 2304995 (delta 1058323), pack-reused 0
Receiving objects: 100% (2304995/2304995), 480.51 MiB | 2.08 MiB/s, done.
Resolving deltas: 100% (1058323/1058323), done.

We've dropped from 34MB/s to 2MB/s, and the overall operation now takes over 3 minutes. For larger repositories the difference is even more painful, taking clones that previously ran in 30-60 seconds and blowing them out to 10-15 minutes. That results in stacking load on Bitbucket Server that eventually causes rejected requests due to excessive queuing.

I stripped out all Bitbucket Server's code and wrote a test in NuProcess that runs git http-backend directly, with the right stdin and environment to produce the same effective operation. (Unfortunately this test isn't really shareable because it relies on some canned stdin I captured, as well as access to a specific Git repository.) With that test, I'm able to reproduce the performance issue without any Bitbucket Server code at all. (It's worth noting that the test executed on Linux or macOS performs fine, with NuProcess speeds essentially identical to ProcessBuilder.)

In trying to track down the issue, I looked through the JDK's source and found they use 4096 + 24 byte buffers for their pipes. Changing WindowsProcess.BUFFER_SIZE from 64K to 4096 + 24 fixes the issue and produces identical throughput with NuProcess compared to ProcessBuilder.

A colleague helping me search for this found some other cases where IOCP's Nagle-like approach has caused problems:

@bturner
Copy link
Collaborator Author

bturner commented Sep 25, 2020

I believe the reason reducing the pipe buffer size in WindowsProcess fixes the performance problem we're seeing is because once the pipe buffer is full there's no reason to wait for any more input--there's no room for it anyway. With that (small) delay avoided, the overall operation ends up being much, much faster despite needing to move more buffers.

bturner added a commit to bturner/NuProcess that referenced this issue Sep 25, 2020
Using a large buffer size for the pipe connecting the child process can
result in IOCP imposing (very short) delays before notifying the port
when a read completes. For a small number of reads the short delays are
no problem, but for processes that produce hundreds of megabytes of
output all those short delays accumulate into a massive performance hit,
reducing throughput compared to ProcessBuilder by an order of magnitude.

Using a smaller pipe buffer does impose some constraint on how much data
can be moved by a single ReadFile or WriteFile call but it also prevents
delays (because there's no point in waiting for more input if the buffer
is already full), resulting in overall improved performance.

- Reduced WindowsProcess.BUFFER_SIZE from 64K to 4096+24, which matches
  the buffer size the JDK's ProcessImpl uses on Windows
- Replaced System.err with a Logger when writing errors
  - This matches how LinuxProcess and OsxProcess are written
- Optimized HANDLE.fromNative to check the pointer directly, to avoid
  creating HANDLE instances for invalid handles
  - This also avoids the base implementation's use of reflection to
    instantiate new HANDLE instances
- Updated PipeBundle to set auto-sync to false on OVERLAPPED instances
  so JNA won't waste time marshaling to/from native code around every
  call to ReadFile or WriteFile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant