-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero copy send #2147
Comments
I am also interested in accepting large amounts of data without excessive copying |
Relevant application issue here: pangeo-data/pangeo#6 |
No objection in principle. The main thing to watch out for is that we not disadvantage the common case to make the extreme cases faster. I don't think that's too hard on the sending side: the common case is for each call to The read side is trickier - even though I prefer the ergonomics of the IOStream interface, the Twisted-style Protocol interface does have advantages for read performance. More dramatic changes may be needed here. If the changes get big enough it may be worth thinking about building more directly on the asyncio primitives (You're on python 3, right? In case you haven't been following along, I've begun the long process of phasing out Tornado's IOLoop in favor of the asyncio event loop. In tornado 5.0 AsyncIOLoop will be the default (and perhaps the only choice on python 3)). |
FYI I have just implemented a binary array protocol (for sending arrays only, from server to client) for Bokeh using |
We'll still need to handle buffering of some sort I suspect. I doubt that the system call will ever except all of the data. We'll need to maintain write buffers, but hopefully those write buffers can contain I haven't looked into read performance yet. I appreciate the warning that it might be more challenging.
Dask supports both, but I don't mind doing Python 3 only for advanced features and extra benefits (>1GB/s bandwidth certainly counts).
This raises two questions:
|
|
@bryevdv we were chatting briefly about profiling. If you do manage to get a good profiling script it would be very interesting to see how much time is spent where. This would certainly allow us to target performance improvements much more effectively. Also cc'ing @eriknw on this issue. He may have time and interest to take it on in the near future? |
My benchmarking script: import struct
import time
import numpy as np
from tornado.tcpserver import TCPServer
from tornado.tcpclient import TCPClient
from tornado.ioloop import IOLoop
from tornado import gen
@gen.coroutine
def read(stream):
nbytes = yield stream.read_bytes(8)
nbytes = struct.unpack('L', nbytes)[0]
data = yield stream.read_bytes(nbytes)
return data
@gen.coroutine
def write(stream, msg):
yield stream.write(struct.pack('L', len(msg)))
yield stream.write(msg)
class MyServer(TCPServer):
@gen.coroutine
def handle_stream(self, stream, address):
data = yield read(stream)
print('server', len(data))
yield write(stream, data)
@gen.coroutine
def f():
data = bytes(np.random.randint(0, 255, dtype='u1', size=100000000).data) # 100M
server = MyServer()
server.listen(8000)
client = TCPClient()
for i in range(5):
stream = yield client.connect('127.0.0.1', 8000, max_buffer_size=int(1e9))
yield write(stream, data)
msg = yield read(stream)
print(len(msg))
if __name__ == '__main__':
start = time.time()
IOLoop().run_sync(f)
end = time.time()
print(end - start)
# To profile
# python -m cProfile -o prof.out server.py
# snakeviz prof.out |
Here is a slightly updated benchmark so that you don't have to install Numpy: https://gist.github.com/pitrou/0f772867008d861c4aa2d2d7b846bbf0 |
There seems to be a 20%+ regression on this benchmark due to commit 992a00e. |
I have a proof-of-concept at #2166. The main drawback is the added constant overhead due to more complicated buffer management written in pure Python. But on large data it's 45% faster. On the benchmark above, 50% of the time is spent in I/O (recv and send) and 22% of the time is spent concatenating buffers in read() (the obligatory |
I believe you're right. Will it at some point be easy to plug the Protocol interface into Tornado (e.g. in combination with TCPClient or something similar)? Edit: or perhaps I can simply use
Are there issues on Windows? |
I wrote an
This is because Edit: I gave uvloop a try. It does speed up the |
I wrote about my current findings here: |
Nice writeup
…On Wed, Oct 18, 2017 at 2:32 PM, Antoine Pitrou ***@***.***> wrote:
I wrote about my current findings here:
https://mail.python.org/pipermail/async-sig/2017-October/000392.html
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2147 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszGWMIQaPX37LbmxeXrVLiodzqOVwks5stkQ_gaJpZM4POqTl>
.
|
I think accessing the asyncio loop and its
I'm not expecting any (additional) issues on windows because asyncio defaults to SelectorEventLoop instead of ProactorEventLoop, so it's going to be pretty similar to what we're already doing there. |
Tornado 5.0 introduced the |
My 2c:
IANA at what level massive transfers are best implemented... |
It would be convenient to reduce CPU use when sending large memoryviews of data. I've brought this up a couple of times
And it has been raised by @bryevdv for websockets
Previously I've closed my issues saying that this wasn't yet a bottleneck for my applications (dask). However now several users are using Dask on HPC systems with fast multi-GB interconnects and this has now become a bottleneck. I'd like to revisit the issue.
Are there any objections or known challenges to handling memoryviews all the way from user input to system call? (other than developer time of course) I'm able to spend some cycles on this problem, but I'd like to verify that it's feasible and check in with core devs to see if there is anything that I should be aware of before starting.
The text was updated successfully, but these errors were encountered: