Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Producer & Fetcher should provide serializes with topic name #836

Closed
simingy opened this issue Sep 29, 2016 · 6 comments
Closed

Producer & Fetcher should provide serializes with topic name #836

simingy opened this issue Sep 29, 2016 · 6 comments

Comments

@simingy
Copy link

simingy commented Sep 29, 2016

Serializers should be topic aware.

From java reference documentation:

byte[] serialize(java.lang.String topic,
                        T data)
T deserialize(java.lang.String topic,
                    byte[] data)

The implementation in Python is missing this information.
KafkaPoducer._serialize is called with topic name, but the topic is not propagated to the serializer.
KafkaConsumer calls Fetcher, whos _deserialize api is missing topic information

@dpkp
Copy link
Owner

dpkp commented Sep 29, 2016

I'm torn here. I think the simplicity of value_serializer=json.dumps is extremely powerful. On the other hand, w/o a more robust interface it is hard/impossible to use a generic serializer w/ schema repository.

If we were to add this interface, I think we'd also want to add from java client:

configure(Map<String, ?> configs, boolean isKey);

Should we break the existing interface and require a more complicated starter config like value_serializer=lambda _, d: json.dumps(d) or perhaps we add configuration keys for the new interface like value_serializer_class / key_serializer_class ?

@simingy
Copy link
Author

simingy commented Sep 29, 2016

value_serializer=json.dumps is OK for many things, but serializing json is not very efficient: the keys themselves take up space.
If using something like an Apache Avro serializer, each topic would have its own schema. The advantage would be that you have both the structure predefined in a format (schema), and the fact that Avro packs only the values, not the keys. This is much more efficient at a network bandwidth level.

All things aside.. I think the proper implementation would be to extend the requirement for value_serializer implementation: always mandate two arguments: topic, and message:

Eg:

def serialize(topic, message):
    pass

it would still be 1 serializer... and whether the implementer uses that topic information is up to them.

@dpkp
Copy link
Owner

dpkp commented Sep 29, 2016

Sorry -- yes, json is extremely slow. I typically would use msgpack.dumps instead, which is much faster. Avro in python is still quite slow, however (have you benchmarked lately?). I've been meaning to hack on fastavro, but anyways I dont think using avro currently is a great story in python.

Nonetheless, changing the existing configuration setting would break operations for most current users and I won't do that w/o a compelling reason. I think adding optional _class configurations is probably the least bad approach.

@simingy
Copy link
Author

simingy commented Sep 29, 2016

could do with that.

we're not extremely performance oriented - not yet at least. it's a great standard for wide-rage deployment though, so that's why we are looking into using it.

@simingy
Copy link
Author

simingy commented Sep 29, 2016

how about this - if the idea of python-kafka is to be as close in implementation as possible to the Java reference, introduce two new class interfaces: Serializer and Deserializer
and when value_serializer and value_deserializer is provided, do a type check: if subclass of the above classes, call them topic/message, otherwise, fall back to current implementation - simple callable.

there is "certain" amount of incentive if backwards compatibility was broken: producer/consumers are multi-topic capable, which means often - a different serializer could be used.. etc. It's a major fall-through from the original Java reference...

@dpkp
Copy link
Owner

dpkp commented Sep 30, 2016

I like that idea. Want to submit a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants