Support Pickle's protocol 5 #3784

jakirkham · 2020-05-07T06:52:08Z

This ensures that objects that do support Pickle's protocol 5 go through out-of-band serialization of buffers. Should avoid some extra memory usage and memory copies that would otherwise be encountered when going through Pickle's older protocols without this feature. Requires Python 3.8+ to work (as that is when this protocol was introduced).

Had considered supporting pickle5 as well to ensure users could get this functionality with earlier versions of Python. However this would involve supporting pickle5 in cloudpickle, which hasn't been done yet ( cloudpipe/cloudpickle#179 ). So am side stepping that for now. Could be handled in a follow-up once that is addressed.

jcrist

This is great to see. I think sticking with the cPython implementation only for now (instead of supporting pickle5) makes sense - as people upgrade python versions the need for pickle5 will go away.

distributed/protocol/pickle.py

jakirkham · 2020-05-07T22:39:17Z

Please let me know if anything else is needed here 🙂

Needed for out-of-band buffer handling.

distributed/protocol/pickle.py

Before calling the user provided buffer callback, collect buffers in an internal list. That way if the mechanism of pickling needs to be changed, the internal list can be purged before handing these to the user. At the end of pickling, make sure the user's buffer callback is called on each buffer in order.

jakirkham · 2020-05-19T19:37:37Z

Planning on merging tomorrow if no comments.

jcrist · 2020-05-19T19:39:24Z

Could you wait a bit? I still want to review this, just a bit swamped right now.

jakirkham · 2020-05-19T19:53:29Z

Sure. Do you have a (rough) sense of when you might be able to look at it (for my own planning)?

jcrist · 2020-05-19T19:54:19Z

Probably sometime tomorrow.

jcrist · 2020-05-29T00:03:58Z

Hmmm, that is concerning. Can you reproduce it on a local cluster?

from dask.distributed import Client
client = Client()
client.submit(lambda: "test function").result()

jakirkham · 2020-05-29T00:16:44Z

Might need to use a NumPy array and force pickle for communication to create a reproducer. Alternatively could use the MemoryviewHolder class from the tests here. Something like that should generate a non-trivial number of buffers.

gshimansky · 2020-05-29T00:30:30Z

@jcrist Yes I am getting back 'test function'.
This is a very subtle bug and for some reason it is reproducible only in test environment. But am able to do this consistently, so I will try to find the reason.

jakirkham · 2020-05-29T05:04:51Z

It would be good to start debugging with something where out-of-band pickling would be used. NumPy arrays would work for example (though we have to force them to go through pickle). Alternatively something that implements pickle protocol 5 (like MemoryviewHolder) that Dask doesn't otherwise know how to serialize. Here's an example with NumPy:

import numpy as np
from distributed import Client

client = Client(serializers=["pickle"])
client.submit(lambda a, b: a + b, np.arange(3), 2).result()

This may serve as a good starting point for debugging further. Though it may need further adjustment. Like using client.run to hit all workers. Or only returning or only passing a NumPy array instead of both (as done above). Additionally there may be other flags we need to set to ensure pickling happens (though using MemoryviewHolder should avoid that).

Hopefully that gives some ideas on how to narrow this down further.

mrocklin · 2020-05-29T14:24:01Z

Do Pandas dataframes support the new protocol? To me that seems like the highest impact application of this work.

…

On Thu, May 28, 2020 at 10:05 PM jakirkham ***@***.***> wrote: It would be good to start debugging with something where out-of-band pickling would be used. NumPy arrays would work for example (though we have to force them to go through pickle). Alternatively something that implements pickle protocol 5 (like MemoryviewHolder <https://github.com/dask/distributed/blob/ddc6377b76f957ef4a3a607f1dc2cf2ba772abc7/distributed/protocol/tests/test_pickle.py#L26-L34>) that Dask doesn't otherwise know how to serialize. Here's an example with NumPy: import numpy as npfrom distributed import Client client = Client(serializers=["pickle"])client.submit(lambda a, b: a + b, np.arange(3), 2).result() This may serve as a good starting point for debugging further. Though it may need further adjustment. Like using client.run to hit all workers. Or only returning or only passing a NumPy array instead of both (as done above). Additionally there may be other flags we need to set to ensure pickling happens (though using MemoryviewHolder should avoid that). Hopefully that gives some ideas on how to narrow this down further. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3784 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFJMA4DD6IHJQE6XHDRT47ABANCNFSM4M3BYFUQ> .

jcrist · 2020-05-29T14:27:30Z

Yes, since pickle recurses, anything that has a __reduce__ implementation that returns numpy arrays (or other objects that directly support pickle 5) will also then use the new protocol. No special casing in pandas needed.

mrocklin · 2020-05-29T15:51:17Z

Well then great. Perhaps that gives us our test case

…

On Fri, May 29, 2020 at 7:27 AM Jim Crist-Harif ***@***.***> wrote: Yes, since pickle recurses, anything that has a __reduce__ implementation that returns bumpy arrays (or other objects that directly support pickle 5) will also then use the new protocol. No special casing in pandas needed. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3784 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTHOV5FKZMGYTGHOH63RT7A6DANCNFSM4M3BYFUQ> .

jakirkham · 2020-05-29T16:28:53Z

Yeah this is what @TomAugspurger and I were learning in issue ( pandas-dev/pandas#34244 ). Also brought this up in comment ( #614 (comment) ). It would be good to have some tests in Pandas that cover pickling with protocol 5 as well. Though based on local exploration this does seem to efficiently serialize Pandas objects.

While more efficient serialization of Pandas objects is an immediate win with impact, I think the bigger win is that any object that supports this protocol is now serialized efficiently. So it lowers the bar for getting efficient serialization with Dask. We've done this sort of work for RAPIDS for example. Possibly other libraries like spaCy will do this as well ( explosion/spaCy#5472 ). It might be nice to identify some other pain points like this one. They should now be low hanging fruit.

Additionally it would be nice to cover the last mile of Python pre-3.8 support through the pickle5 backport package. This would allow Dask to have efficient serialization on any supported Python version. Did some work to make cloudpickle compatible with pickle5 in PR ( cloudpipe/cloudpickle#370 ) thanks to help from @pierreglaser. Stuck on a couple remaining test failures (if anyone has thoughts there 😉). Expect the changes for pickle5 support in Dask to be much simpler once cloudpickle supports it.

Once we wrap that up, it would be good to share this work more openly. Was thinking of doing a blogpost to discuss this along the lines of your, "Pickle isn't slow, it's a protocol" blogpost. It would be nice to point out to the community this avenue for efficient serialization in distributed computing (whether or not they use Dask). In fact it seems some people are already taking notice ( ray-project/ray#8577 ) 😉

gshimansky · 2020-05-29T17:06:23Z

Well, I am actually experiencing the problem when I run tests on Modin project https://github.com/modin-project/modin and Pandas DataFrames and Series are the objects that are passed to Dask workers.

jcrist · 2020-05-29T17:10:08Z

Are you able to reproduce without modin?

gshimansky · 2020-05-29T17:14:35Z

Are you able to reproduce without modin?

Not yet. It is quite fragile and easy to lose when making changes.

jakirkham · 2020-05-29T17:22:15Z

@gshimansky, have you tried my suggestion above?

gshimansky · 2020-05-29T17:28:03Z

I tried to debug distributed code. So far I found that HIGHEST_PROTOCOL value is always 4. I also found out that pickle_loads gets called with frames of length 2 which results in buffers being not-None. frames of length 2 come from deserialize_bytes but there the trace is lost because I don't know whether decompress should return such value or not.

jakirkham · 2020-05-29T17:33:25Z

Maybe Dask is splitting up the frames somehow when serializing and not joining them correctly when deserializing? That's my best guess. Though an example would really help (in addition to being a useful test case for a fix 😉)

gshimansky · 2020-05-29T17:40:11Z

My question is, how should it work correctly when HIGHEST_PROTOCOL equals 4 and therefore loads doesn't have a buffers argument? If frames argument comes of length 2, this results in buffers not-None and loads gets called with buffers which it doesn't support.

jakirkham · 2020-05-29T17:46:29Z

We don't collect buffers when protocol is 4. So it should be empty.

However Dask sometimes splits up frames for various reasons (networking, limitations on compression, etc.) independently of whether pickling was used. If Dask didn't join them back together correctly, that might present as the bug you are seeing.

Again we would need an MRE to confirm that though.

gshimansky · 2020-05-29T17:57:17Z

Ok so do you mean that if protocol is 4 then decompress in deserialize_bytes should not return frames list of length greater than 1? I need to know if this behavior is allowed or not to investigate farther.

jakirkham · 2020-05-29T18:01:21Z

distributed/protocol/serialize.py



 def pickle_loads(header, frames):
-    return pickle.loads(b"".join(frames))
+    x, buffers = frames[0], frames[1:]


My expectation is that frames[1:] would be an empty list here. It might not be though. If it's not, that would be good to fix. Whether that has to do with deserialize_bytes behavior or not, I'm less sure. Maybe someone else knows?

frames in pickle_loads comes from deserialize_bytes for me (take a look at exception stack trace). And for me frames[1:] is not empty. That's why I am asking if this is expected or not.

Right. So do we have a reproducer? That should help us debug this further.

FWIW I've been trying to reproduce this too, but haven't had any luck. Though I believe you that there could be a problem here. It's just tricky to come up with fixes in the dark. So anything you can come up with would help 🙂

If you say that Dask is able to split frames, may it happen when large pandas DataFrames are serialized? My test is a benchmark which operates on such objects, about 20M of records long and 40 columns wide.
Modin splits them for parallel processing, but they still remain considerably large in memory.

Maybe something like PR ( #3841 ) will help? Still lack a test case though. So I have no way to confirm whether it actually helps (or ensure we don't accidentally regress).

Even when sending large bytes that do get split by frame_split_size, I'm unable to reproduce an error. Things are properly reassembled on the other side. Not sure what's going on there.

Yeah me neither unfortunately. Happy to dig deeper once we identify a reproducer 🙂

A similar issue was raised recently ( #3851 ). Not sure if that is the same. Suspect PR ( #3639 ) fixes this. Would be good if you can try though.

This was fixed by PR ( #3639 ).

jakirkham · 2020-07-15T19:32:10Z

This has been extended to Python versions pre-3.8 with pickle5 with PR ( #3849 ).

jakirkham mentioned this pull request May 7, 2020

Try to use pickle5 when available #3425

Closed

jakirkham force-pushed the support_pickle_protocol_5 branch from ac18efb to c3e0191 Compare May 7, 2020 07:24

jcrist reviewed May 7, 2020

View reviewed changes

distributed/protocol/pickle.py Outdated Show resolved Hide resolved

jakirkham force-pushed the support_pickle_protocol_5 branch 4 times, most recently from e30341e to d8558c2 Compare May 7, 2020 20:03

jakirkham added 7 commits May 7, 2020 15:48

Assign header and frames before returning

b3b5c2d

Import HIGHEST_PROTOCOL at top-level

b3dfdb2

Collect keyword arguments to *dumps

4ad42d4

Assign result and return once at end

f5380c1

Support out-of-band buffer serialization

ed54236

Require cloudpickle version 1.3.0

d374832

Needed for out-of-band buffer handling.

Test Pickle with out-of-band buffers

1982049

jakirkham force-pushed the support_pickle_protocol_5 branch from d8558c2 to 1982049 Compare May 7, 2020 22:48

jakirkham added 5 commits May 10, 2020 20:15

Import PickleBuffer (if available)

4b54ba5

Test serialize/deserialize with pickle

f5aed04

Check serialized header + frames

924dd45

Check out-of-band buffers' content

6efa0ac

Take memoryview of PickleBuffer for testing

70b7f41

jcrist requested changes May 11, 2020

View reviewed changes

distributed/protocol/pickle.py Outdated Show resolved Hide resolved

jakirkham added 2 commits May 18, 2020 15:59

Only collect buffers if buffer_callback exists

0578b6a

jakirkham requested a review from jcrist May 18, 2020 23:18

jakirkham mentioned this pull request May 19, 2020

Efficient Pandas serialization #614

Closed

jakirkham mentioned this pull request May 29, 2020

pickle5 support #2495

Closed

jakirkham mentioned this pull request May 29, 2020

Compress payload during shuffle operation dask/dask#6259

Closed

jakirkham commented May 29, 2020

View reviewed changes

This was referenced May 29, 2020

WIP: Track frame lengths in pickle_dumps #3841

Closed

Optionally use pickle5 #3849

Merged

This was referenced Jun 9, 2020

Use pickle protocol 5 with NumPy object arrays #3871

Merged

Out-of-band pickling misses some buffers with nested object arrays numpy/numpy#16537

Open

Accelerate intra-node IPC with shared memory dask/dask#6267

Closed

jakirkham mentioned this pull request Aug 4, 2020

ENH: accommodate dask_loads for default serializers of dask InsightSoftwareConsortium/ITK#1942

Closed

7 tasks

thewtex mentioned this pull request Aug 6, 2020

Improve Dask serializers, registration InsightSoftwareConsortium/ITK#1948

Open

jakirkham mentioned this pull request Sep 9, 2020

Draft of Keras pickle RFC adriangb/community#1

Merged

jakirkham mentioned this pull request Mar 24, 2021

Memory issues related to Pandas series dask/zict#19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Pickle's protocol 5 #3784

Support Pickle's protocol 5 #3784

jakirkham commented May 7, 2020

jcrist left a comment

jakirkham commented May 7, 2020

jakirkham commented May 19, 2020

jcrist commented May 19, 2020

jakirkham commented May 19, 2020

jcrist commented May 19, 2020

jcrist commented May 29, 2020

jakirkham commented May 29, 2020

gshimansky commented May 29, 2020

jakirkham commented May 29, 2020

mrocklin commented May 29, 2020 via email

jcrist commented May 29, 2020 •

edited

Loading

mrocklin commented May 29, 2020 via email

jakirkham commented May 29, 2020

gshimansky commented May 29, 2020

jcrist commented May 29, 2020

gshimansky commented May 29, 2020 •

edited

Loading

jakirkham commented May 29, 2020

gshimansky commented May 29, 2020

jakirkham commented May 29, 2020

gshimansky commented May 29, 2020

jakirkham commented May 29, 2020

gshimansky commented May 29, 2020

jakirkham May 29, 2020

gshimansky May 29, 2020

jakirkham May 29, 2020

jakirkham May 29, 2020

gshimansky May 29, 2020 •

edited

Loading

jakirkham May 29, 2020

jcrist May 29, 2020

jakirkham May 29, 2020

jakirkham Jun 4, 2020

jakirkham Jun 11, 2020

jakirkham commented Jul 15, 2020

Support Pickle's protocol 5 #3784

Support Pickle's protocol 5 #3784

Conversation

jakirkham commented May 7, 2020

jcrist left a comment

Choose a reason for hiding this comment

jakirkham commented May 7, 2020

jakirkham commented May 19, 2020

jcrist commented May 19, 2020

jakirkham commented May 19, 2020

jcrist commented May 19, 2020

jcrist commented May 29, 2020

jakirkham commented May 29, 2020

gshimansky commented May 29, 2020

jakirkham commented May 29, 2020

mrocklin commented May 29, 2020 via email

jcrist commented May 29, 2020 • edited Loading

mrocklin commented May 29, 2020 via email

jakirkham commented May 29, 2020

gshimansky commented May 29, 2020

jcrist commented May 29, 2020

gshimansky commented May 29, 2020 • edited Loading

jakirkham commented May 29, 2020

gshimansky commented May 29, 2020

jakirkham commented May 29, 2020

gshimansky commented May 29, 2020

jakirkham commented May 29, 2020

gshimansky commented May 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gshimansky May 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakirkham commented Jul 15, 2020

jcrist commented May 29, 2020 •

edited

Loading

gshimansky commented May 29, 2020 •

edited

Loading

gshimansky May 29, 2020 •

edited

Loading