Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize and split #4541

Merged
merged 12 commits into from
Feb 26, 2021
Merged

Serialize and split #4541

merged 12 commits into from
Feb 26, 2021

Conversation

madsbk
Copy link
Contributor

@madsbk madsbk commented Feb 23, 2021

Simplify the serialization, splitting, and writability of objects.

This work is a precursor to #4531 that makes is possible to have msgpack extract serializable objects while supporting splitting and maintain writability of objects.

  • Tests added / passed
  • Passes black distributed / flake8 distributed

@madsbk madsbk marked this pull request as ready for review February 23, 2021 17:21
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mads! 😄

Had a few comments below

Comment on lines 64 to 67
header = {
"serializer": "pickle",
"pickle-writeable": tuple(not f.readonly for f in frames[1:]),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do something similar in dask_dumps and cuda_dumps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we want to do that here or in the individual registered dumps/loads functions like the numpy serialization does?
Anyways, I don't think it should block this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a good question. I think support for NumPy arrays is a bit older as it is a primary use case. So that function may just be a bit unusual because of that.

We should be ok pulling this out of the NumPy case and handling them generally. I would think that should yield simpler easier to understand code, but could be wrong about that

For context tracking writeable frames was needed to solve some gnarly issues ( #1978 ) ( #3943 ). So if there is a general way to solve this, that would be ideal to ensure they don't resurface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but let's do that in a follow up PR.
It assumes that dask_dumps returns a memoryview compatible object, is that right?
Also, we apparently allow additionally frames when deserializing: https://github.com/dask/distributed/blob/master/distributed/protocol/tests/test_serialize.py#L82

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure sounds good 🙂

Yeah though I think that is pretty closely enforced today

I think that is just showing we ignore empty frames, but could be missing something

distributed/protocol/numpy.py Outdated Show resolved Hide resolved
distributed/protocol/serialize.py Outdated Show resolved Hide resolved
@@ -31,7 +30,6 @@ def cuda_deserialize_rmm_device_buffer(header, frames):
@dask_serialize.register(rmm.DeviceBuffer)
def dask_serialize_rmm_device_buffer(x):
header, frames = cuda_serialize_rmm_device_buffer(x)
header["writeable"] = (None,) * len(frames)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to note that None has a special meaning here. It basically means it doesn't matter whether this is read-only or writeable. IOW skip trying to copy this. The reason we include this (and in particular on the Dask serialization path) is to avoid an extra copy of buffers we plan to move to device later

That said, I think the changes here may already capture this use case. Just wanted to surface the logic to hopefully clarify what is going on currently and catch any remaining things not yet addressed

@jakirkham
Copy link
Member

Have we tried running the CUDA tests locally as well?

@madsbk
Copy link
Contributor Author

madsbk commented Feb 26, 2021

Have we tried running the CUDA tests locally as well?

Yes, they are all passing on my laptop :)

@jakirkham jakirkham merged commit 7f8bb81 into dask:master Feb 26, 2021
@jakirkham
Copy link
Member

Thanks Mads! 😄

@madsbk madsbk deleted the serialize_and_split branch March 1, 2021 08:03
@jakirkham jakirkham mentioned this pull request Feb 1, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants