Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bindings/python): support pickle [de]serialization for Operator #5324

Merged
merged 13 commits into from
Nov 16, 2024

Conversation

TennyZhuang
Copy link
Contributor

Which issue does this PR close?

Partially finished #5316

Rationale for this change

Now an Operator can be serialized and deserialized via pickle, as a result, the Operator class can be used in multiprocessing context.

What changes are included in this PR?

In the PR, we store the arguments for Operator constructor, so that we can recover the Operator from pickle deserialization.

Are there any user-facing changes?

  • support pickle [de]serialization for Operator in python binding
  • support Operator to be used in multiprocessing context

@Zheaoli I'm not very familiar with Python binding, PTAL for the PR, thanks!

@TennyZhuang
Copy link
Contributor Author

It seems like we should add a new capability MultiProcessingAccess to guard the behavior.

"""
serialized = pickle.dumps(operator)
deserialized = pickle.loads(serialized)
assert repr(operator) == repr(deserialized)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, I think the ideal behavior for pickle [de]serialization is that the object has the same behavior/operation result.

So I suggest the test can be

data="abc"
operator.write("/abc","abc")
serialized = pickle.dumps(operator)
del operator
gc.collect()

deserialized = pickle.loads(serialized)
assert "abc"==deserialized.read("/abc")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

core: self.core.clone().into(),
__scheme: self.__scheme.clone(),
__map: self.__map.clone(),
})
}

fn __repr__(&self) -> String {
Copy link
Member

@Zheaoli Zheaoli Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need add extra __hash__ method to represent two object are equal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to implement, we can't get internal information of Operator.

Copy link
Member

@Zheaoli Zheaoli Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about hash the scheme and the map?

repr is used for show what the object is looking at. Maybe not a good method for identify

Copy link
Contributor Author

@TennyZhuang TennyZhuang Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer don’t implement hash and eq for Operator, and remove the equal assertion based on repr.

(schema, map) can’t identify an Operator, leave eq and empty empty is better. Only the ID of PyObject can identify the Operator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer don’t implement hash and eq for Operator, and remove the equal assertion based on repr.

Agreed. ocore::Operator doesn't implement Eq ahd Hash too.

@Zheaoli
Copy link
Member

Zheaoli commented Nov 15, 2024

It seems like we should add a new capability MultiProcessingAccess to guard the behavior.

SGTM, maybe Pickable would be better?

@TennyZhuang
Copy link
Contributor Author

TennyZhuang commented Nov 16, 2024

Pickable

IMO all operators can be serialized as pickle.

In most Python libraries, all objects can be serialized as pickle, but it may not be identical when deserialized. The behavior is acceptable.

We need the capability only for tests. We can ignore backends which doesn't support the capability in the test.

@Xuanwo
Copy link
Member

Xuanwo commented Nov 16, 2024

I'm considering adding a shared capability to indicate whether this storage service can be shared across different processes or even different instances.

Based on this definition:

  • memory is not shared: it cannot be accessed by other processes.
  • sled is not shared: it has a unique lock that only one process can use.
  • fs is shared: different processes can access the same fs as long as they have it mounted.
  • s3 is shared: obviously.

I believe this capability is better than the following:

  • multi_processing_access: too long and doesn't align with exstsing capability naming style.
  • pickable: python only and doesn't reflect the root cause.

I don't like the idea that:

We need the capability only for tests

I view it as a design failure for OpenDAL to include something in the public API solely for testing purposes. shared is somewhat useful for users who want to understand what will happen if they try to share the same storage services across processes or instances.

@Zheaoli
Copy link
Member

Zheaoli commented Nov 16, 2024

I'm considering adding a shared capability to indicate whether this storage service can be shared across different processes or even different instances.

SGTM

@Xuanwo Xuanwo requested a review from Copilot November 16, 2024 12:53

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 4 out of 4 changed files in this pull request and generated no suggestions.

Comments skipped due to low confidence (2)

bindings/python/src/operator.rs:230

  • The method __getnewargs_ex__ might not correctly handle the conversion of kwargs to a Python dictionary. Ensure that kwargs are correctly converted to a Python dictionary.
fn __getnewargs_ex__(&self, py: Python) -> PyResult<PyObject> {

bindings/python/src/operator.rs:51

  • [nitpick] The naming of the private attributes __scheme and __map should be consistent with the convention used in the rest of the codebase.
__scheme: ocore::Scheme,
@TennyZhuang
Copy link
Contributor Author

I don't like the idea that:

Please note that pickle doesn’t mean shared. The user can pickle the operator, close the operator, pass the bytes to another process and unpickle the operator. So we should implement pickle for every operator, both shared and unique.

What we only can do is disable the tests for unique operator.

@Xuanwo
Copy link
Member

Xuanwo commented Nov 16, 2024

Please note that pickle doesn’t mean shared. The user can pickle the operator, close the operator, pass the bytes to another process and unpickle the operator. So we should implement pickle for every operator, both shared and unique.

That's correct. As you mentioned, we just need to avoid testing it in a shared manner (which implies that we can access the same storage service after unpickling) if this storage service doesn't shared.

I prefer to add the shared capability, which I believe is a comprehensive solution. However, I'm also open to the python binding just testing against specified services like s3, gcs, and fs if you feel that's sufficient.

@TennyZhuang
Copy link
Contributor Author

In fact, there are two types of capabilities in all backends.

  1. shared
    Access rights can be shared, all services backends implement this, while embedded engines don't.

  2. send
    Access rights can be transferred
    Almost all backends implement this, except for memory backend such as dashmap.

Of course, we don't need to be so strict and complicated, I support temporarily adding only shared capability.

@Xuanwo
Copy link
Member

Xuanwo commented Nov 16, 2024

I support temporarily adding only shared capability.

Would you like to add this capability? I believe we can include them in the next 0.51 release.

@TennyZhuang
Copy link
Contributor Author

I support temporarily adding only shared capability.

Would you like to add this capability? I believe we can include them in the next 0.51 release.

I can add it.

Please note that shared only indicates shared between processes but not shared between threads, it may be a little confusing if we will add backends which only support single-thread later.

However, I guess we will not support that in a while.

@Xuanwo
Copy link
Member

Xuanwo commented Nov 16, 2024

Please note that shared only indicates shared between processes but not shared between threads, it may be a little confusing if we will add backends which only support single-thread later.

However, I guess we will not support that in a while.

Currently, we don't offer any services that aren't Send because Access requires services to be both Send and Sync. There are no plans to add such services, but we can revisit this decision if necessary.

@TennyZhuang
Copy link
Contributor Author

Will test another PR use the CI flow, just ignore it.

Copy link
Member

@Zheaoli Zheaoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM about the Python binding part

I will merge this after all ci green

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @TennyZhuang for working on this and thanks @Zheaoli's review.

Please merge when you feel this PR is ok, @Zheaoli

@Zheaoli Zheaoli merged commit 73bbb85 into apache:main Nov 16, 2024
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants