Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero copy send #2147

Closed
mrocklin opened this issue Sep 6, 2017 · 18 comments
Closed

Zero copy send #2147

mrocklin opened this issue Sep 6, 2017 · 18 comments
Labels

Comments

@mrocklin
Copy link
Contributor

mrocklin commented Sep 6, 2017

It would be convenient to reduce CPU use when sending large memoryviews of data. I've brought this up a couple of times

And it has been raised by @bryevdv for websockets

Previously I've closed my issues saying that this wasn't yet a bottleneck for my applications (dask). However now several users are using Dask on HPC systems with fast multi-GB interconnects and this has now become a bottleneck. I'd like to revisit the issue.

Are there any objections or known challenges to handling memoryviews all the way from user input to system call? (other than developer time of course) I'm able to spend some cycles on this problem, but I'd like to verify that it's feasible and check in with core devs to see if there is anything that I should be aware of before starting.

@mrocklin
Copy link
Contributor Author

mrocklin commented Sep 6, 2017

I am also interested in accepting large amounts of data without excessive copying

@mrocklin
Copy link
Contributor Author

mrocklin commented Sep 6, 2017

Relevant application issue here: pangeo-data/pangeo#6

@bdarnell
Copy link
Member

bdarnell commented Sep 8, 2017

No objection in principle. The main thing to watch out for is that we not disadvantage the common case to make the extreme cases faster. I don't think that's too hard on the sending side: the common case is for each call to IOStream.write to translate to one syscall, so we just need to try that syscall first before we try to put anything into the send buffer. It'll make that code a little bit more complex, but nothing too bad.

The read side is trickier - even though I prefer the ergonomics of the IOStream interface, the Twisted-style Protocol interface does have advantages for read performance. More dramatic changes may be needed here. If the changes get big enough it may be worth thinking about building more directly on the asyncio primitives (You're on python 3, right? In case you haven't been following along, I've begun the long process of phasing out Tornado's IOLoop in favor of the asyncio event loop. In tornado 5.0 AsyncIOLoop will be the default (and perhaps the only choice on python 3)).

@bryevdv
Copy link

bryevdv commented Sep 8, 2017

FYI I have just implemented a binary array protocol (for sending arrays only, from server to client) for Bokeh using array.tobytes. Early eval seems to point to it being a nice and significant improvement, but for array sizes of 1000x1000 it's not quite interactive when scrubbing a slider to update. I'd definitely he happy to see if less copies (for WS sends) makes a difference.

@mrocklin
Copy link
Contributor Author

mrocklin commented Sep 8, 2017

No objection in principle. The main thing to watch out for is that we not disadvantage the common case to make the extreme cases faster. I don't think that's too hard on the sending side: the common case is for each call to IOStream.write to translate to one syscall, so we just need to try that syscall first before we try to put anything into the send buffer. It'll make that code a little bit more complex, but nothing too bad.

We'll still need to handle buffering of some sort I suspect. I doubt that the system call will ever except all of the data. We'll need to maintain write buffers, but hopefully those write buffers can contain memoryview objects.

I haven't looked into read performance yet. I appreciate the warning that it might be more challenging.

You're on python 3, right?

Dask supports both, but I don't mind doing Python 3 only for advanced features and extra benefits (>1GB/s bandwidth certainly counts).

In case you haven't been following along, I've begun the long process of phasing out Tornado's IOLoop in favor of the asyncio event loop. In tornado 5.0 AsyncIOLoop will be the default (and perhaps the only choice on python 3)).

This raises two questions:

  1. How long will Tornado's IOStream stay around? It would be unfortunate to invest time optimizing this implementation only to have it be replaced by an AsyncIO implementation.
  2. When is the next expected release? This doesn't need to be very soon for my use cases, but it seems like the next release might be 5.0 and that 5.0 might be a long time (possibly forever) away.

@bdarnell
Copy link
Member

bdarnell commented Sep 9, 2017

  1. I doubt IOStream will ever go away completely. It'll remain for backwards compatibility if nothing else. However, it is possible that the upper levels of the stack (HTTPServer and friends) migrate some day to something other than IOStream (i.e. asyncio's native interfaces). This is unlikely to happen until after we drop support for python 2.
  2. The next release will be 5.0, and it's sooner than "never" but I can't put a precise time on it because I've had very little time for tornado recently. I think it'll probably be in 2018. Tornado 5.0 will use asyncio by default on python 3, but won't drop support for python 2. Dropping support for python 2 is in the "long time (possibly forever) away" category.

@mrocklin
Copy link
Contributor Author

I'd definitely he happy to see if less copies (for WS sends) makes a difference.

@bryevdv we were chatting briefly about profiling. If you do manage to get a good profiling script it would be very interesting to see how much time is spent where. This would certainly allow us to target performance improvements much more effectively.

Also cc'ing @eriknw on this issue. He may have time and interest to take it on in the near future?

@mrocklin
Copy link
Contributor Author

My benchmarking script:

import struct
import time

import numpy as np
from tornado.tcpserver import TCPServer
from tornado.tcpclient import TCPClient
from tornado.ioloop import IOLoop
from tornado import gen


@gen.coroutine
def read(stream):
    nbytes = yield stream.read_bytes(8)
    nbytes = struct.unpack('L', nbytes)[0]
    data = yield stream.read_bytes(nbytes)
    return data


@gen.coroutine
def write(stream, msg):
    yield stream.write(struct.pack('L', len(msg)))
    yield stream.write(msg)


class MyServer(TCPServer):
    @gen.coroutine
    def handle_stream(self, stream, address):
        data = yield read(stream)
        print('server', len(data))
        yield write(stream, data)


@gen.coroutine
def f():
    data = bytes(np.random.randint(0, 255, dtype='u1', size=100000000).data)  # 100M
    server = MyServer()
    server.listen(8000)

    client = TCPClient()
    for i in range(5):
        stream = yield client.connect('127.0.0.1', 8000, max_buffer_size=int(1e9))
        yield write(stream, data)
        msg = yield read(stream)
    print(len(msg))


if __name__ == '__main__':
    start = time.time()
    IOLoop().run_sync(f)
    end = time.time()
    print(end - start)


# To profile
# python -m cProfile -o prof.out server.py
# snakeviz prof.out

https://gist.github.com/73da550d7ed242b77282cb944a15a79c

@pitrou
Copy link
Contributor

pitrou commented Oct 17, 2017

Here is a slightly updated benchmark so that you don't have to install Numpy: https://gist.github.com/pitrou/0f772867008d861c4aa2d2d7b846bbf0

@pitrou
Copy link
Contributor

pitrou commented Oct 17, 2017

There seems to be a 20%+ regression on this benchmark due to commit 992a00e.

@pitrou
Copy link
Contributor

pitrou commented Oct 17, 2017

I have a proof-of-concept at #2166. The main drawback is the added constant overhead due to more complicated buffer management written in pure Python. But on large data it's 45% faster. On the benchmark above, 50% of the time is spent in I/O (recv and send) and 22% of the time is spent concatenating buffers in read() (the obligatory b''.join).

@pitrou
Copy link
Contributor

pitrou commented Oct 17, 2017

@bdarnell

The read side is trickier - even though I prefer the ergonomics of the IOStream interface, the Twisted-style Protocol interface does have advantages for read performance.

I believe you're right. Will it at some point be easy to plug the Protocol interface into Tornado (e.g. in combination with TCPClient or something similar)?

Edit: or perhaps I can simply use loop.asyncio_loop.create_connection and loop.asyncio_loop.create_server?

In tornado 5.0 AsyncIOLoop will be the default (and perhaps the only choice on python 3)

Are there issues on Windows?

@pitrou
Copy link
Contributor

pitrou commented Oct 18, 2017

I wrote an asyncio variant of the benchmark script at https://gist.github.com/pitrou/719e73c1df51e817d618186833a6e2cc
You'll be surprised by the results:

This is because asyncio has its own buffering when writing, and it doesn't seem optimized for extremely large data.

Edit: I gave uvloop a try. It does speed up the asyncio benchmark by 40%, but is still ~10% slower than #2166

@pitrou
Copy link
Contributor

pitrou commented Oct 18, 2017

I wrote about my current findings here:
https://mail.python.org/pipermail/async-sig/2017-October/000392.html

@mrocklin
Copy link
Contributor Author

mrocklin commented Oct 18, 2017 via email

@bdarnell
Copy link
Member

I believe you're right. Will it at some point be easy to plug the Protocol interface into Tornado (e.g. in combination with TCPClient or something similar)?

Edit: or perhaps I can simply use loop.asyncio_loop.create_connection and loop.asyncio_loop.create_server?

I think accessing the asyncio loop and its create_connection/create_server methods is the way to go for a Protocol-style interface. I don't have any other plans to introduce this interface in Tornado.

In tornado 5.0 AsyncIOLoop will be the default (and perhaps the only choice on python 3)

Are there issues on Windows?

I'm not expecting any (additional) issues on windows because asyncio defaults to SelectorEventLoop instead of ProactorEventLoop, so it's going to be pretty similar to what we're already doing there.

@bdarnell
Copy link
Member

bdarnell commented Jan 1, 2019

Tornado 5.0 introduced the read_into method and reduced copying in the write path. I think we've done as much as we're going to do for this, so I'm closing this issue.

@bdarnell bdarnell closed this as completed Jan 1, 2019
@dimaqq
Copy link

dimaqq commented Mar 8, 2019

My 2c:

  • if there's TLS, it doesn't matter
  • if there's HTTP/2, I think (CMIIW) it doesn't matter because frame size is capped at 16MB-1
  • this leaves: HTTP 1.0, HTTP/1.1, HTTP/1.1 chunked, other app protocols, raw TCP, other network protocols

IANA at what level massive transfers are best implemented...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants