Stdout and stderr handling can be extremely slow on Windows #117

bturner · 2020-09-25T02:37:11Z

When used to pump processes that produce lots of output in small chunks (which happens a lot with git commands--especially git http-backend and git upload-pack, which are used to serve clones and fetches), the existing I/O completion handler code in ProcessCompletions runs afoul of how IOCP decides when to signal ready reads. When some data is available to read, IOCP waits (a very short wait) to see if more data arrives before triggering. For short-lived processes, or processes that produce their output in big chunks, that works fine. But if the process produces output in many, many tiny pieces, that extra delay amounts to a huge performance hit.

This came up for Bitbucket Server in BSERV-12599, where it was reported that, since we switched over to NuProcess to run git http-backend and git upload-pack, hosting operations on Windows are an order of magnitude slower than they were using ProcessBuilder and blocking I/O.

To give a sense of scale, cloning a 500MB repository (Bitbucket Server's own source) via Bitbucket Server using ProcessBuilder looks like this:

C:\Temp>git clone --no-checkout http://localhost:7990/bitbucket/scm/at/bitbucket-server.git
Cloning into 'bitbucket-server'...
remote: Enumerating objects: 2304995, done.
remote: Counting objects: 100% (2304995/2304995), done.
remote: Compressing objects: 100% (779774/779774), done.
remote: Total 2304995 (delta 1058323), reused 2304995 (delta 1058323), pack-reused 0
Receiving objects: 100% (2304995/2304995), 480.51 MiB | 33.96 MiB/s, done.
Resolving deltas: 100% (1058323/1058323), done.

480MB at 34MB/s, with the entire operation taking about 25 seconds (the 34MB/s transfer is only part of the overall time).

Switching over to NuProcess completely tanks performance:

C:\Temp>git clone --no-checkout http://localhost:7990/bitbucket/scm/at/bitbucket-server.git
Cloning into 'bitbucket-server'...
remote: Enumerating objects: 2304995, done.
remote: Counting objects: 100% (2304995/2304995), done.
remote: Compressing objects: 100% (779774/779774), done.
remote: Total 2304995 (delta 1058323), reused 2304995 (delta 1058323), pack-reused 0
Receiving objects: 100% (2304995/2304995), 480.51 MiB | 2.08 MiB/s, done.
Resolving deltas: 100% (1058323/1058323), done.

We've dropped from 34MB/s to 2MB/s, and the overall operation now takes over 3 minutes. For larger repositories the difference is even more painful, taking clones that previously ran in 30-60 seconds and blowing them out to 10-15 minutes. That results in stacking load on Bitbucket Server that eventually causes rejected requests due to excessive queuing.

I stripped out all Bitbucket Server's code and wrote a test in NuProcess that runs git http-backend directly, with the right stdin and environment to produce the same effective operation. (Unfortunately this test isn't really shareable because it relies on some canned stdin I captured, as well as access to a specific Git repository.) With that test, I'm able to reproduce the performance issue without any Bitbucket Server code at all. (It's worth noting that the test executed on Linux or macOS performs fine, with NuProcess speeds essentially identical to ProcessBuilder.)

In trying to track down the issue, I looked through the JDK's source and found they use 4096 + 24 byte buffers for their pipes. Changing WindowsProcess.BUFFER_SIZE from 64K to 4096 + 24 fixes the issue and produces identical throughput with NuProcess compared to ProcessBuilder.

A colleague helping me search for this found some other cases where IOCP's Nagle-like approach has caused problems:

https://news.ycombinator.com/item?id=19520953
Windows event notification python-trio/trio#52
- This one links off to several issues Chrome developers have run into related to IOCP performance

The text was updated successfully, but these errors were encountered:

bturner · 2020-09-25T02:39:57Z

I believe the reason reducing the pipe buffer size in WindowsProcess fixes the performance problem we're seeing is because once the pipe buffer is full there's no reason to wait for any more input--there's no room for it anyway. With that (small) delay avoided, the overall operation ends up being much, much faster despite needing to move more buffers.

Using a large buffer size for the pipe connecting the child process can result in IOCP imposing (very short) delays before notifying the port when a read completes. For a small number of reads the short delays are no problem, but for processes that produce hundreds of megabytes of output all those short delays accumulate into a massive performance hit, reducing throughput compared to ProcessBuilder by an order of magnitude. Using a smaller pipe buffer does impose some constraint on how much data can be moved by a single ReadFile or WriteFile call but it also prevents delays (because there's no point in waiting for more input if the buffer is already full), resulting in overall improved performance. - Reduced WindowsProcess.BUFFER_SIZE from 64K to 4096+24, which matches the buffer size the JDK's ProcessImpl uses on Windows - Replaced System.err with a Logger when writing errors - This matches how LinuxProcess and OsxProcess are written - Optimized HANDLE.fromNative to check the pointer directly, to avoid creating HANDLE instances for invalid handles - This also avoids the base implementation's use of reflection to instantiate new HANDLE instances - Updated PipeBundle to set auto-sync to false on OVERLAPPED instances so JNA won't waste time marshaling to/from native code around every call to ReadFile or WriteFile

brettwooldridge closed this as completed in c00789a May 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stdout and stderr handling can be extremely slow on Windows #117

Stdout and stderr handling can be extremely slow on Windows #117

bturner commented Sep 25, 2020 •

edited

Loading

bturner commented Sep 25, 2020

Stdout and stderr handling can be extremely slow on Windows #117

Stdout and stderr handling can be extremely slow on Windows #117

Comments

bturner commented Sep 25, 2020 • edited Loading

bturner commented Sep 25, 2020

bturner commented Sep 25, 2020 •

edited

Loading