Zero copy send #2147

mrocklin · 2017-09-06T16:45:12Z

It would be convenient to reduce CPU use when sending large memoryviews of data. I've brought this up a couple of times

And it has been raised by @bryevdv for websockets

Reduce Websocket copies / accept memoryviews #2102

Previously I've closed my issues saying that this wasn't yet a bottleneck for my applications (dask). However now several users are using Dask on HPC systems with fast multi-GB interconnects and this has now become a bottleneck. I'd like to revisit the issue.

Are there any objections or known challenges to handling memoryviews all the way from user input to system call? (other than developer time of course) I'm able to spend some cycles on this problem, but I'd like to verify that it's feasible and check in with core devs to see if there is anything that I should be aware of before starting.

mrocklin · 2017-09-06T16:45:35Z

I am also interested in accepting large amounts of data without excessive copying

mrocklin · 2017-09-06T16:47:46Z

Relevant application issue here: pangeo-data/pangeo#6

bdarnell · 2017-09-08T02:00:36Z

No objection in principle. The main thing to watch out for is that we not disadvantage the common case to make the extreme cases faster. I don't think that's too hard on the sending side: the common case is for each call to IOStream.write to translate to one syscall, so we just need to try that syscall first before we try to put anything into the send buffer. It'll make that code a little bit more complex, but nothing too bad.

The read side is trickier - even though I prefer the ergonomics of the IOStream interface, the Twisted-style Protocol interface does have advantages for read performance. More dramatic changes may be needed here. If the changes get big enough it may be worth thinking about building more directly on the asyncio primitives (You're on python 3, right? In case you haven't been following along, I've begun the long process of phasing out Tornado's IOLoop in favor of the asyncio event loop. In tornado 5.0 AsyncIOLoop will be the default (and perhaps the only choice on python 3)).

bryevdv · 2017-09-08T05:33:57Z

FYI I have just implemented a binary array protocol (for sending arrays only, from server to client) for Bokeh using array.tobytes. Early eval seems to point to it being a nice and significant improvement, but for array sizes of 1000x1000 it's not quite interactive when scrubbing a slider to update. I'd definitely he happy to see if less copies (for WS sends) makes a difference.

mrocklin · 2017-09-08T19:44:17Z

No objection in principle. The main thing to watch out for is that we not disadvantage the common case to make the extreme cases faster. I don't think that's too hard on the sending side: the common case is for each call to IOStream.write to translate to one syscall, so we just need to try that syscall first before we try to put anything into the send buffer. It'll make that code a little bit more complex, but nothing too bad.

We'll still need to handle buffering of some sort I suspect. I doubt that the system call will ever except all of the data. We'll need to maintain write buffers, but hopefully those write buffers can contain memoryview objects.

I haven't looked into read performance yet. I appreciate the warning that it might be more challenging.

You're on python 3, right?

Dask supports both, but I don't mind doing Python 3 only for advanced features and extra benefits (>1GB/s bandwidth certainly counts).

In case you haven't been following along, I've begun the long process of phasing out Tornado's IOLoop in favor of the asyncio event loop. In tornado 5.0 AsyncIOLoop will be the default (and perhaps the only choice on python 3)).

This raises two questions:

How long will Tornado's IOStream stay around? It would be unfortunate to invest time optimizing this implementation only to have it be replaced by an AsyncIO implementation.
When is the next expected release? This doesn't need to be very soon for my use cases, but it seems like the next release might be 5.0 and that 5.0 might be a long time (possibly forever) away.

bdarnell · 2017-09-09T03:00:31Z

I doubt IOStream will ever go away completely. It'll remain for backwards compatibility if nothing else. However, it is possible that the upper levels of the stack (HTTPServer and friends) migrate some day to something other than IOStream (i.e. asyncio's native interfaces). This is unlikely to happen until after we drop support for python 2.
The next release will be 5.0, and it's sooner than "never" but I can't put a precise time on it because I've had very little time for tornado recently. I think it'll probably be in 2018. Tornado 5.0 will use asyncio by default on python 3, but won't drop support for python 2. Dropping support for python 2 is in the "long time (possibly forever) away" category.

mrocklin · 2017-09-11T12:58:53Z

I'd definitely he happy to see if less copies (for WS sends) makes a difference.

@bryevdv we were chatting briefly about profiling. If you do manage to get a good profiling script it would be very interesting to see how much time is spent where. This would certainly allow us to target performance improvements much more effectively.

Also cc'ing @eriknw on this issue. He may have time and interest to take it on in the near future?

mrocklin · 2017-09-13T12:02:47Z

My benchmarking script:

import struct
import time

import numpy as np
from tornado.tcpserver import TCPServer
from tornado.tcpclient import TCPClient
from tornado.ioloop import IOLoop
from tornado import gen


@gen.coroutine
def read(stream):
    nbytes = yield stream.read_bytes(8)
    nbytes = struct.unpack('L', nbytes)[0]
    data = yield stream.read_bytes(nbytes)
    return data


@gen.coroutine
def write(stream, msg):
    yield stream.write(struct.pack('L', len(msg)))
    yield stream.write(msg)


class MyServer(TCPServer):
    @gen.coroutine
    def handle_stream(self, stream, address):
        data = yield read(stream)
        print('server', len(data))
        yield write(stream, data)


@gen.coroutine
def f():
    data = bytes(np.random.randint(0, 255, dtype='u1', size=100000000).data)  # 100M
    server = MyServer()
    server.listen(8000)

    client = TCPClient()
    for i in range(5):
        stream = yield client.connect('127.0.0.1', 8000, max_buffer_size=int(1e9))
        yield write(stream, data)
        msg = yield read(stream)
    print(len(msg))


if __name__ == '__main__':
    start = time.time()
    IOLoop().run_sync(f)
    end = time.time()
    print(end - start)


# To profile
# python -m cProfile -o prof.out server.py
# snakeviz prof.out

https://gist.github.com/73da550d7ed242b77282cb944a15a79c

pitrou · 2017-10-17T10:24:52Z

Here is a slightly updated benchmark so that you don't have to install Numpy: https://gist.github.com/pitrou/0f772867008d861c4aa2d2d7b846bbf0

pitrou · 2017-10-17T10:25:23Z

There seems to be a 20%+ regression on this benchmark due to commit 992a00e.

pitrou · 2017-10-17T15:45:36Z

I have a proof-of-concept at #2166. The main drawback is the added constant overhead due to more complicated buffer management written in pure Python. But on large data it's 45% faster. On the benchmark above, 50% of the time is spent in I/O (recv and send) and 22% of the time is spent concatenating buffers in read() (the obligatory b''.join).

pitrou · 2017-10-17T15:56:32Z

@bdarnell

The read side is trickier - even though I prefer the ergonomics of the IOStream interface, the Twisted-style Protocol interface does have advantages for read performance.

I believe you're right. Will it at some point be easy to plug the Protocol interface into Tornado (e.g. in combination with TCPClient or something similar)?

Edit: or perhaps I can simply use loop.asyncio_loop.create_connection and loop.asyncio_loop.create_server?

In tornado 5.0 AsyncIOLoop will be the default (and perhaps the only choice on python 3)

Are there issues on Windows?

pitrou · 2017-10-18T13:14:43Z

I wrote an asyncio variant of the benchmark script at https://gist.github.com/pitrou/719e73c1df51e817d618186833a6e2cc
You'll be surprised by the results:

it's 10% slower than vanilla Tornado
it's almost twice slower than [WIP] Fix #2147: mostly zero-copy writes and reads in IOStream #2166

This is because asyncio has its own buffering when writing, and it doesn't seem optimized for extremely large data.

Edit: I gave uvloop a try. It does speed up the asyncio benchmark by 40%, but is still ~10% slower than #2166

pitrou · 2017-10-18T18:31:53Z

I wrote about my current findings here:
https://mail.python.org/pipermail/async-sig/2017-October/000392.html

mrocklin · 2017-10-18T18:39:47Z

Nice writeup

…

On Wed, Oct 18, 2017 at 2:32 PM, Antoine Pitrou ***@***.***> wrote: I wrote about my current findings here: https://mail.python.org/pipermail/async-sig/2017-October/000392.html — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2147 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszGWMIQaPX37LbmxeXrVLiodzqOVwks5stkQ_gaJpZM4POqTl> .

bdarnell · 2017-10-19T03:22:40Z

I believe you're right. Will it at some point be easy to plug the Protocol interface into Tornado (e.g. in combination with TCPClient or something similar)?

Edit: or perhaps I can simply use loop.asyncio_loop.create_connection and loop.asyncio_loop.create_server?

I think accessing the asyncio loop and its create_connection/create_server methods is the way to go for a Protocol-style interface. I don't have any other plans to introduce this interface in Tornado.

In tornado 5.0 AsyncIOLoop will be the default (and perhaps the only choice on python 3)

Are there issues on Windows?

I'm not expecting any (additional) issues on windows because asyncio defaults to SelectorEventLoop instead of ProactorEventLoop, so it's going to be pretty similar to what we're already doing there.

bdarnell · 2019-01-01T18:50:01Z

Tornado 5.0 introduced the read_into method and reduced copying in the write path. I think we've done as much as we're going to do for this, so I'm closing this issue.

dimaqq · 2019-03-08T04:38:54Z

My 2c:

if there's TLS, it doesn't matter
if there's HTTP/2, I think (CMIIW) it doesn't matter because frame size is capped at 16MB-1
this leaves: HTTP 1.0, HTTP/1.1, HTTP/1.1 chunked, other app protocols, raw TCP, other network protocols

IANA at what level massive transfers are best implemented...

pitrou mentioned this issue Oct 17, 2017

[WIP] Fix #2147: mostly zero-copy reads and writes in IOStream #2165

Closed

pitrou mentioned this issue Oct 17, 2017

[WIP] Fix #2147: mostly zero-copy writes and reads in IOStream #2166

Closed

pitrou mentioned this issue Oct 18, 2017

TCP comms could be faster dask/distributed#1477

Closed

pitrou added a commit to pitrou/tornado that referenced this issue Oct 19, 2017

Issue tornadoweb#2147: avoid copies on non-small writes

191b17e

pitrou added a commit to pitrou/tornado that referenced this issue Oct 19, 2017

Issue tornadoweb#2147: avoid copies on non-small writes

11866db

This was referenced Oct 19, 2017

Issue #2147: avoid copies on non-small writes #2169

Merged

IOStream._handle_write blocks when _write_buffer is large #2110

Open

bdarnell pushed a commit that referenced this issue Nov 16, 2017

Issue #2147: avoid copies on non-small writes (#2169)

1b464c4

bdarnell added the iostream label Jan 21, 2018

bdarnell closed this as completed Jan 1, 2019

szjur mentioned this issue Oct 26, 2022

[BUG] Saltstack uses ancient Tornado with inefficient buffer management saltstack/salt#62960

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero copy send #2147

Zero copy send #2147

mrocklin commented Sep 6, 2017

mrocklin commented Sep 6, 2017

mrocklin commented Sep 6, 2017

bdarnell commented Sep 8, 2017

bryevdv commented Sep 8, 2017

mrocklin commented Sep 8, 2017

bdarnell commented Sep 9, 2017

mrocklin commented Sep 11, 2017

mrocklin commented Sep 13, 2017

pitrou commented Oct 17, 2017

pitrou commented Oct 17, 2017

pitrou commented Oct 17, 2017 •

edited

Loading

pitrou commented Oct 17, 2017 •

edited

Loading

pitrou commented Oct 18, 2017 •

edited

Loading

pitrou commented Oct 18, 2017

mrocklin commented Oct 18, 2017 via email

bdarnell commented Oct 19, 2017

bdarnell commented Jan 1, 2019

dimaqq commented Mar 8, 2019

Zero copy send #2147

Zero copy send #2147

Comments

mrocklin commented Sep 6, 2017

mrocklin commented Sep 6, 2017

mrocklin commented Sep 6, 2017

bdarnell commented Sep 8, 2017

bryevdv commented Sep 8, 2017

mrocklin commented Sep 8, 2017

bdarnell commented Sep 9, 2017

mrocklin commented Sep 11, 2017

mrocklin commented Sep 13, 2017

pitrou commented Oct 17, 2017

pitrou commented Oct 17, 2017

pitrou commented Oct 17, 2017 • edited Loading

pitrou commented Oct 17, 2017 • edited Loading

pitrou commented Oct 18, 2017 • edited Loading

pitrou commented Oct 18, 2017

mrocklin commented Oct 18, 2017 via email

bdarnell commented Oct 19, 2017

bdarnell commented Jan 1, 2019

dimaqq commented Mar 8, 2019

pitrou commented Oct 17, 2017 •

edited

Loading

pitrou commented Oct 17, 2017 •

edited

Loading

pitrou commented Oct 18, 2017 •

edited

Loading