-
-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows event notification #52
Comments
Something to think about in this process: according to the docs on Windows, with overlapped I/o you can still get WSAEWOULDBLOCK ("too many outstanding overlapped requests"). No-one on the internet seems to have any idea when this actually occurs or why. Twisted has a FIXME b/c they don't handle it, just propagate the error out. Maybe we can do better? Or maybe not. (I bet if you don't throttle UDP sends at all then that's one way to get this? Though libuv AFAICT doesn't throttle UDP sends at all and I guess they do ok... it'd be worth trying this just to see how Windows handles it.) |
Small note to remember: if we start using |
Hmm, someone on stackoverflow says that they wrote a test program to see if zero-byte IOCP sends could be used to check for writability, and that it didn't work for them :-/ |
That was me, and afaict the IOCP 0-byte trick only works for readability. According to https://blog.grijjy.com/2018/08/29/creating-high-performance-udp-servers-on-windows-and-linux/ the 0-byte read trick does with MSG_PEEK (even though the msdn docs say it only works for non-overlapped sockets?).
|
@tmm1 Hello! 👋 Thank you for that :-). Looking at this again 18 months later, and with that information in mind... there's still no urgent need to change what we're doing, though if we move forward with some of the wilder ideas in #399 then that might make IOCP more attractive (in particular see #399 (comment)). How bad would it be to make I guess we could dig into the "undocumented sorcery" mentioned above and see how bad it is... |
I got curious about this and did do a bit more digging. So when you call Normally, running on standard out-of-the-box install of Windows, That part is actually fine. It's arcane and low-level, but documented well-enough in the source code of projects like libuv and wepoll, and basically sensible. You put together an array of structs listing the sockets you want to wait for and what events you want to wait for, and then submit it to the kernel, and get back a regular IOCP notification when it's ready. However! This is not the only thing Like most unstructured hooking APIs, the designers had high hopes that this would be used for all kinds of fancy things, but in practice it suffers from the deadly flaw of lack of composability:
And in fact, Apparently it picks one of the hooks at random and forces it to handle all of the sockets. Which can't possibly work, but nonetheless:
Also, the LSP design means that you can inject arbitrary code into any process by modifying the registry keys that configure the LSPs. This tends to break sandboxes, and means that buggy LSPs cause crashes critical system processes like So this whole design is fundamentally broken, and MS is actively trying to get rid of it: as of Windows 8 / Window Server 2012, it's officially deprecated, and to help nudge people away from it, Windows 8 "Metro" apps simply ignore any installed LSPs. Nonetheless, it is apparently used by old firewalls and anti-virus programs and also malware. (I guess they must just blindly hook everything, to work around the composability problems?), so some percentage of Windows machines in the wild do still have LSPs installed. Another fun aspect of LSPs: there are some functions that are not "officially" part of Winsock, but are rather "Microsoft extensions", like Twisted and asyncio both get this wrong, btw – they call You can detect whether an LSP has hooked any particular socket by using You can also blatantly violate the whole LSP abstraction layer by passing Given that libuv seems to get away with this in practice, and Windows 8 Metro apps do something like this, and that Twisted and asyncio are buggy in the presence of LSPs and seem to get away with it in practice, I think we can probably use But... having used NPfMW says:
On a Windows system, you can get a list of "providers" (both LSPs and base providers) by running
(This only has the service providers; the command also lists "name space providers", which is a different extension mechanism, but I left those out.) So if we group by entries with the same GUID, there's one base provider that does IPv4 (TCP, UDP, and RAW), another that does IPv6 (TCP, UDP, and RAW), one that handles Hyper-V sockets, one that handles IrDA sockets, and one that handles "RSVP" (whatever that is). The three special GUIDs that libuv thinks can work with the AFD magic seem to correspond to the IPv4 and IPv6 providers, and then the third one from some googling appears to MS's standard Bluetooth provider. (I guess this VM doesn't have the Bluetooth drivers installed.) So: I think libuv on Windows always ends up using the AFD magic when it's working IPv4 or IPv6 or Bluetooth sockets. When it gets something else – meaning IrDA, Hyper-V, RSVP, or something even more exotic involving a third-party driver – it falls back on calling But AFAIK, on Windows, Python's Final conclusionI think we can use |
I randomly stumbled upon this message. Since this is stuff is so arcane that it's actually kinda fun, some extra pointers:
cough https://github.com/metoo10987/WinNT4/blob/f5c14e6b42c8f45c20fe88d14c61f9d6e0386b8e/private/ntos/afd/poll.c#L68-L707
This is actually kind of a bug in libuv -- this logic makes libuv use it's "slow" mode too eagerly (libuv has a fallback mechanism where it simply calls select() in a thread). If a socket is bound to an IFS LSP, it'll have a different GUID that's not one of those 3 on the whitelist, however the socket would still work with IOCTL_AFD_POLL. Since all base service providers are built into windows and they all use AFD, it's simply enough to check whether the protocol chain length (in the
Note that windows itself actually does this most of the time: since windows vista, select() cannot be intercepted by LSPs unless the LSP does all of the following:
I have yet to encounter the first LSP that actually does this. If you wanted to be super pedantic about it, you could call both SIO_BASE_HANDLE and SIO_BSP_HANDLE_SELECT and check if they are the same.
There's really no need to do any of this grouping in practice, they all use the same code path in the AFD driver. I removed it from wepoll in piscisaureus/wepoll@e7e8385. Also note that RSVP is deprecated and no longer supported, and that since the latest windows 10 update there is AF_UNIX which also works with IOCTL_AFD_POLL just fine. |
@piscisaureus Oh hey, thanks for dropping by! I was wondering why wepoll didn't seem to check GUIDs, and was completely baffled by the docs for What do you think of unconditionally calling And do you happen to know what happens if you try to use the AFD poll stuff with a handle that isn't a MSAFD base provider handle? Does it give a sensible error, or...? |
Libuv doesn't do that. It uses SIO_BASE_HANDLE to determine whether it can use some optimizations, but it doesn't attempt to bypass LSPs. If you want to bypass LSPs outright, then you should do it when you create the socket. This is done by using When you create an non-IFS-LSP-bound socket and then get and use the base handle everywhere, it would likely cause memory leaks and other weirdness (in particular if you also use the base handle when closing the socket with
Not too many people will care probably. The only LSP i've seen people use intentionally is Proxifier. That's an IFS LSP, so there's no need to bypass it, the socket is a valid AFD handle.
You'll get a STATUS_INVALID_HANDLE error. |
Oh, you're right. I was looking at It looks like the only place
I see, excellent points, thank you. And on further thought, it probably wouldn't have worked anyway, since for the sockets we control we can use IOCP I guess the main reason this seemed attractive is that it would let us have a single code path, without treating LSP sockets differently. I'm kind of terrified of that, because I have no idea how to test the LSP code paths. Do you by chance happen to know any reliable way to set up LSPs for testing? ("OK, so to make sure we're testing our code under realistic conditions, we're going to install malware on all our Windows test systems...") Though, it's also probably not a huge deal, because it sounds like we can have a single straight-line code path in all cases anyway, where we do the Thanks for all the information, by the way, this is super helpful. |
Indeed CancelIo/CancelIoEx doesn't work with non IFS sockets. IIRC libuv also works around issues with SetFileCompletionNotificationModes not working in the presence of non-IFS LSPs. Come to think of it, I'm kind of surprised that CreateIoCompletionPort does work, it seems unlikely that LSPs can trap that?
This was also the main reason we added uv_poll -- to integrate with c-ares (dns library).
Proxifier, as mentioned earlier, is an IFS LSP I have seen in the wild. IIRC older versions of the Windows SDK also contained a reference implementation for different LSP types. You could compile and install it. I've never gotten around to doing that. |
Looking again at Network Programming for Microsoft Windows, 2nd ed., I think the way LSPs handle IO completion ports is that when you call That said, I have no idea why this doesn't work with
That is definitely what we're going to do in Trio, at least to start :-). Oh beautiful, we are totally going to steal that wholesale. Thank you! New planSo it sounds like at this point we now know everything we need to to reimplement |
this little article has some useful information on the poorly-documented |
Correct. The file name doesn't matter and can be the same every time, I'm talking about different handles.
I don't think that's gonna be a problem. But who knows 🤷♂ |
There is one more thing that I forgot to mention: In wepoll I never associate more than 32 sockets with one The reason for this is that With 32 sockets per AFD handle, the overhead of CancelIoEx is unmeasurable. I felt that the overhead of opening 1.03 handles per socket (on average) was worth it to avoid this problem. |
@piscisaureus whoa, yikes! Trio doesn't expose a register/unregister API, it exposes a one-off wait-for-this-socket API, so we end up registering/unregistering sockets constantly, so that's very important to know about! |
OK, can confirm; with a little script that creates N sockets, registers READ polls on all of them, and then cancels the polls, on a virtualized Windows 10 v1607 (OS build 14393.2214), I get:
This is using the current version of #1269, which uses a single handle for all AFD_POLL operations. (Note the socket creation times are a bit slow b/c it includes setting up loopback TCP connections.) So there's a small super-linearity in the issue times, and a huge super-linearity in the cancel times, though it doesn't get severe until >1000 sockets. For comparison, here's the same script on Linux (using epoll):
Test script: import time
import trio
import trio.testing
import socket
async def main():
for total in [10, 100, 1_000, 10_000, 20_000, 30_000]:
def pt(start, end, desc):
total_ms = (end - start) * 1000
per_sock_us = total_ms * 1000 / total
print(f"{desc}: {total_ms:.2f} ms, {per_sock_us:.2f} µs/socket")
print(f"\n-- {total} sockets --")
t_start = time.perf_counter()
sockets = []
for _ in range(total // 2):
a, b = socket.socketpair()
sockets += [a, b]
t_sockets = time.perf_counter()
pt(t_start, t_sockets, "socket creation")
async with trio.open_nursery() as nursery:
for s in sockets:
nursery.start_soon(trio.hazmat.wait_readable, s)
await trio.testing.wait_all_tasks_blocked()
t_registered = time.perf_counter()
pt(t_sockets, t_registered, "spawning wait tasks")
nursery.cancel_scope.cancel()
t_cancelled = time.perf_counter()
pt(t_registered, t_cancelled, "cancelling wait tasks")
for sock in sockets:
sock.close()
pt(t_cancelled, time.perf_counter(), "closing sockets")
trio.run(main) |
Hmm, and here's something embarrassing... if I run that script with 500 sockets, then on current Trio master (
While with the #1269's IOCP-based polling, I get:
So switching to IOCP makes this like... ~2x slower. I'm guessing this is mostly the overhead of having moved more code into Python and using FFI to generate C calls at runtime, versus the Of course the reason I did 500 sockets was because So I think the conclusions for now are:
|
OK, confirmed this empirically – on the old
And on the IOCP backend there's no slowdown, like we'd expect:
|
@piscisaureus btw, based on these measurements the cost of each individual |
So maybe this is an optimization to consider: instead of calling
It could also be overhead associated with making more syscalls. Personally I wouldn't be too worried about it. Polling 500 sockets where all sockets are active is right in the sweet spot for
I had no idea it was that bad, I just noticed that
I don't know what they're doing. (The win2k source code for |
Hmm. I'm having trouble visualizing how this would be a noticeable win. For something like kqueue or io_uring, where you can submit multiple events in a single syscall, obviously batching makes sense. (We should look into doing that for kqueue, in fact...) For epoll or AFD, the only win would be if you would have cancelled the submission on the same iteration of the event loop where you submitted it, like you say. I think the only two cases where this would happen are:
Am I missing something? It's not obvious to me why this would be a noticeable win for libuv either – your API encourages a different pattern for registering/unregistering waiters, but it still seems like it would be pretty rare for the same socket to get registered and then unregistered again within a single tick. Yet you say that libuv took the trouble to implement this. So that seems like more evidence that I might be missing something. |
I am not sure myself that it would actually be a win for Trio. I suggested it because... (emphasis mine)
... gave me the impression that cancellation is a thing that happens often. On linux it's easy to see how this might happen. Let's consider the case where a user wants to send a large buffer over a socket:
So in every 'write cycle' the user sees Again, I'm not actually campaigning for this change, and I don't know for sure whether it is applicable to Trio. |
The
That could be a performance killer if that is still the current behavior. |
#1269 landed a while ago, so I'm going to close this. I doubt we'll have any trouble finding it if we need to reference something. :-) |
Apparently Windows is adding Yet Another I/O Management System. Basically io_uring-but-for-windows, which is neat. Unfortunately, though, it goes through WaitForMultipleEvents, not IOCP: https://windows-internals.com/i-o-rings-when-one-i-o-operation-is-not-enough/ |
I have not read all of this page just the top couple items but I have a serious question how come Trio-Python will not accept connections on any IP except 127.0.0.1? I am using |
Please create a new issue instead of necroposting, this issue is discussion about how Trio should handle system events on windows. |
Problem
Windows has 3 incompatible families of event notifications APIs: IOCP,
select
/WSAPoll
, andWaitForMultipleEvents
-and-variants. They each have unique capabilities. This means: if you want to be able to react to all the different possible events that Windows can signal, then you must use all 3 of these. Needless to say, this creates a challenge for event loop design. There are a number of potentially viable ways to arrange these pieces; the question is which one we should use.(Actually, all 3 together still isn't sufficient, b/c there are some things that still require threads – like console IO – and I'm ignoring GUI events entirely because Trio isn't a GUI library. But never mind. Just remember that when someone tells you that Windows' I/O subsystem is great, that their statement isn't wrong but does require taking a certain narrow perspective...)
Considerations
The WaitFor*Event family
The
Event
-related APIs are necessary to, for example, wait for a notification that a child process has exited. (The job object API provides a way to request IOCP notifications about process death, but the docs warn that the notifications are lossy and therefore useless...) Otherwise though they're very limited – in particular they have both O(n) behavior and max 64 objects in an interest set – so you definitely don't want to use these as your primary blocking call. We're going to be calling these in a background thread of some kind. The two natural architectures are to useWaitForSingleObject(Ex)
and allocate one-thread-per-event, or else useWaitForMultipleObjects(Ex)
and try and coalesce up to 64 events into each thread (substantially more complicated to implement but with 64x less memory overhead for thread stacks, if it matters). This is orthogonal to the rest of this issue, so it gets its own thread: #233IOCP
IOCP is the crown jewel of Windows the I/O subsystem, and what you generally hear recommended. It follows a natively asynchronous model where you just go ahead and issue a read or write or whatever, and it runs in the background until eventually the kernel tells you it's done. It provides an O(1) notification mechanism. It's pretty slick. But... it's not as obvious a choice as everyone makes it sound. (Did you know the Chrome team has mostly given up on trying to make it work?)
Issues:
When doing a UDP send, the send is only notified as complete once the packet hits the wire; i.e., using IOCP for UDP totally removes in-kernel buffering/flow-control. So to get decent throughput you must implement your own buffering system allowing multiple UDP sends to be in flight at once (but not too many because you don't want to introduce arbitrary latency). Or you could just use the non-blocking API and the kernel worries about this for you. (This hit Chrome hard; they switched to using non-blocking IO for UDP on Windows. ref1, ref2.)
When doing a TCP receive with a large buffer, apparently the kernel does a Nagle-like thing where it tries to hang onto the data for a while before delivering it to the application, thus introducing pointless latency. (This also bit Chrome hard; they switched to using non-blocking IO for TCP receive on Windows. ref1, ref2)
Sometimes you really do want to check whether a socket is readable before issuing a read: in particular, apparently outstanding IOCP receive buffers get pinned into kernel memory or some such nonsense, so it's possible to exhaust system resources by trying to listen to a large number of mostly-idle sockets.
Sometimes you really do want to check whether a socket is writable before issuing a write: in particular, because it allows adaptive protocols to provide lower latency if they can delay deciding what bytes to write until the last moment.
Python provides a complete non-blocking API out-of-the-box, and we use this API on other platforms, so using non-blocking IO on Windows as well is much MUCH simpler for us to implement than IOCP, which requires us to pretty much build our own wrappers from scratch.
On the other hand, IOCP is the only way to do a number of things like: non-blocking IO to the filesystem, or monitoring the filesystem for changes, or non-blocking IO on named pipes. (Named pipes are popular for talking to subprocesses – though it's also possible to use a socket if you set it up right.)
select/WSAPoll
You can also use
select
/WSAPoll
. This is the only documented way to check if a socket is readable/writable. However:As is well known, these are O(n) APIs, which sucks if you have lots of sockets. It's not clear how much it sucks exactly -- just copying the buffer into kernel-space probably isn't a big deal for realistic interest set sizes -- but clearly it's not as nice as O(1). On my laptop,
select.select
on 3 sets of 512 idle sockets takes <200 microseconds, so I don't think this will, like, immediately kill us. Especially since people mostly don't run big servers on Windows? OTOH an empty epoll on the same laptop returns in ~0.6 microseconds, so there is some difference...select.select
is limited to 512 sockets, but this is trivially overcome; the Windowsfd_set
structure is just a array of SOCKETs + a length field, which you can allocate in any size you like (Windows: wait_{read,writ}able limited to 512 sockets #3). (This is a nice side-effect of Windows never having had a dense fd space. This also meansWSAPoll
doesn't have much reason to exist. Unlike other platforms wherepoll
beatsselect
becausepoll
uses an array andselect
uses a bitmap,WSAPoll
is not really any more efficient thanselect
. Its only advantage is that it's similar to how poll works on other platforms... but it's gratuitously incompatible. The one other interesting feature is that you can do an alertable wait with it, which gives a way to cancel it from another thread without using an explicit wakeup socket, viaQueueUserAPC
.)Non-blocking IO on windows is apparently a bit inefficient because it adds an extra copy. (I guess they don't have zero-copy enqueueing of data to receive buffers? And on send I guess it makes sense that you can do that legitimately zero-copy with IOCP but not with nonblocking, which is nice.) Again I'm not sure how much this matters given that we don't have zero-copy byte buffers in Python to start with, but it's a thing.
select
only works for sockets; you still need IOCP etc. for responding to other kinds of notifications.Options
Given all of the above, our current design is a hybrid that uses
select
and non-blocking IO for sockets, with IOCP available when needed. We runselect
in the main thread, and IOCP in a worker thread, with a wakeup socket to notify when IOCP events occur. This is vastly simpler than doing it the other way around, because you can trivially queue work to an IOCP from any thread, while if you want to modifyselect
's interest set from another thread it's a mess. As an initial design, this makes a lot of sense, because it allows us to provide full features (including e.g.wait_writable
for adaptive protocols), avoid the tricky issues that IOCP creates for sockets, and requires a minimum of special code.The other attractive option would be if we could solve the issues with IOCP and switch to using it alone – this would be simpler and get rid of the O(n)
select
. However, as we can see above, there are a whole list of challenges that would need to be overcome first.Working around IOCP's limitations
UDP sends
I'm not really sure what the best approach here is. One option is just to limit the number of outstanding UDP data to some fixed amount (maybe tunable through a "virtual" (i.e. implemented by us) sockopt), and drop packets or return errors if we exceed that. This is clearly solvable in principle, it's just a bit annoying to figure out the details.
Spurious extra latency in TCP receives
I think that using the
MSG_PUSH_IMMEDIATE
flag should solve this.Checking readability / writability
It turns out that IOCP actually can check readability! It's not mentioned on MSDN at all, but there's a well-known bit of folklore about the "zero-byte read". If you issue a zero-byte read, it won't complete until there's data ready to read. ref1 (← official MS docs! also note this is ch. 6 of "NPfMW", referenced below), ref2, ref3.
That's for SOCK_STREAM sockets. What about SOCK_DGRAM? libuv does zero-byte reads with
MSG_PEEK
set (to avoid consuming the packet, truncating it to zero bytes in the process). MSDN explicitly says that this doesn't work (MSG_PEEK
and overlapped IO supposedly don't work together), but I guess I trust libuv more than MSDN? I don't 100% trust either – this would need to be verified.What about writability? Empirically, if you have a non-blocking socket on windows with a full send buffer and you do a zero-byte send, it returns
EWOULDBLOCK
. (This is weird; other platforms don't do this.) If this behavior also translates to IOCP sends, then this zero-byte send trick would give us a way to use IOCP to check writability on SOCK_STREAM sockets.For writability of SOCK_DGRAM I don't think there's any trick, but it's not clear how meaningful SOCK_DGRAM writability is anyway. If we do our own buffering than presumably we can implement it there.
Alternatively, there is a remarkable piece of undocumented sorcery, where you reach down directly to make syscalls, bypassing the Winsock userland, and apparently can get OVERLAPPED notifications when a socket is readable/writable: ref1, ref2, ref3, ref4, ref5. I guess this is how
select
is implemented? The problem with this is that it only works if your sockets are implemented directly in the kernel, which is apparently not always the case (because of like... antivirus tools and other horrible things that can interpose themselves into your socket API). So I'm inclined to discount this as unreliable. [Edit: or maybe not, see below]Implementing all this junk
I actually got a ways into this. Then I ripped it out when I realized how many nasty issues there were beyond just typing in long and annoying API calls. But it could easily be resurrected; see 7e7a809 and its parent.
TODO
If we do want to switch to using IOCP in general, then the sequence would go something like:
check whether zero-byte sends give a way to check TCP writability via IOCP – this is probably the biggest determinant of whether going to IOCP-only is even possible (might be worth checking what doing UDP sends withMSG_PARTIAL
does too while we're at it)check whether you really can do zero-byte reads on UDP sockets like libuv claimsfigure out what kind of UDP send buffering strategy makes sense (or if we decide that UDP sends can just drop packets instead of blocking then I guess the non-blocking APIs remain viable even if we can't dowait_socket_writable
on UDP sockets)At this point we'd have the information to decide whether we can/should go ahead. If so, then the plan would look something like:migrate away from[Not necessary, AFD-basedselect
for the cases that can't use IOCP readable/writable checking:select
should work for these too]connectacceptimplementwait_socket_readable
andwait_socket_writable
on top of IOCP and get rid ofselect
(but at this point we're still doing non-blocking I/O on sockets, just using IOCP as aselect
replacement)(optional / someday) switch to using IOCP for everything instead of non-blocking I/ONew plan:
wait_socket_{readable,writable}
using AFD, and confirm it worksThe text was updated successfully, but these errors were encountered: