fix: register correct unwrapped schema #6188

big-andy-coates · 2020-09-11T19:58:49Z

Description

This commit fixes several issues and refactors a lot of the serde code around wrapping and unwrapping single values. This needs to be done before we can support key formats that use wrapping / unwrapping.

The main issues being fixed are:

allow each format to define if it supported wrapping and/or unwrapping. (Not possible with current design)
pass the correct wrapping / unwrapping flags are passed to key vs value formats when creating serde. (bug in code passes same SerdeOptions to key and value).
register the correct wrapped / unwrapped schema with the SR. (bug in existing code meant registered format is always wrapped).

At the same time, the way wrapping / unwrapping was handled in the code wasn't great. Formats like JSON needed to be able to handle both wrapped and unwrapped schemas and values, depending on whether the user explicitly set wrapping or unwrapping, vs the default behaviour of the format. This commit refactors the code such that the format will always be passed the a consistent schema and the set of serde features the format should use when creating the serde. This simplifies things and paves the way to user-define-serde.

Reviewing notes:

There's some refactoring of how physical and persistence schemas are created and used:

SerdeOptions can now split the global options into key and value specific EnabledSerdeFeatures where,
EnabledSerdeFeatures is a set of SerdeFeatures that have been validated to ensure there are no clashing features (currently checking wrap and unwrap are not both set).
PhysicalSchema is now really just a combination LogicalSchema and SerdeOptions.
PersistenceSchema is still using the Connect schema internally for now, but that schema will always be a STRUCT containing the key/value columns. The instance also tracks the key/value serde features.

How serde are created has been refactored:

The KsqlSerdeFactory is no more. Instead,
everything is hidden behind the Format interface, which now has a createSerde method, rather than one that returns a KSqlSerdeFactory. This is much cleaner and simple.
The old implementations of KsqlSerdeFactory are now just implementation details of specific formats, and are free to be changed as needed.
Format deals in terms of SerdeFeatures, not SerdeOptions.
GenericRowSerde and GenericKeySerde are no longer responsible for handling wrapping and unwrapping, they just pass down and expect Struct. (Common code has moved to GenericSerdeFactory).
This means Format.createSerde can return correctly typed Serde<Struct>, rather than some unknown maybe Struct maybe some primitive.
Each format is responsible for dealing with the features it supports.
KafkaFormat and DelimitedFormat can now support unwrapping and don't need to do anything to support it.
Only Connect formats support both wrapping and unwrapping, so the ConnectFormat handles this by extracting the single field's schema and wrapping the serde created by the sub-class with serde that handle the unwrapping.

Testing done

usual

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

This commit fixes several issues and refactors a lot of the serde code around wrapping and unwrapping single values. The main issues being fixed are: 1. allow each format to define if it supported wrapping and/or unwrapping. (Not possible with current design) 2. pass the correct wrapping / unwrapping flags are passed to key vs value formats when creating serde. (bug in code passes same SerdeOptions to key and value). 3. register the correct wrapped / unwrapped schema with the SR. (bug in existing code meant registered format is always wrapped). At the same time, the way wrapping / unwrapping was handled in the code wasn't great. Formats like `JSON` needed to be able to handle both wrapped and unwrapped schemas and values, depending on whether the user _explicitly_ set wrapping or unwrapping, vs the default behaviour of the format. This commit refactors the code such that the format will always be passed the a consistent schema and the set of serde features the format should use when creating the serde. This simplifies things and paves the way to user-define-serde.

agavra

LGTM I reviewed this change with a somewhat relaxed lens (it's hard to dig into these refactors at the code level) but what I checked looks good to me and your description is an overall positive change! Would have been nice to split it up into a few smaller PRs though 😅

ksqldb-common/src/main/java/io/confluent/ksql/schema/ksql/SchemaConverters.java

ksqldb-common/src/main/java/io/confluent/ksql/serde/EnabledSerdeFeatures.java

ksqldb-examples/src/main/java/io/confluent/ksql/datagen/DataGenProducer.java

ksqldb-serde/src/main/java/io/confluent/ksql/serde/avro/AvroSchemas.java

agavra · 2020-09-14T18:00:18Z

ksqldb-serde/src/main/java/io/confluent/ksql/serde/connect/ConnectFormat.java

-    final boolean unwrapSingle = serdeOptions.valueWrapping()
-        .map(option -> option == SerdeOption.UNWRAP_SINGLE_VALUES)
-        .orElse(false);
+    if (schema.features().enabled(SerdeFeature.UNWRAP_SINGLES)) {


it seems a little weird to encapsulate this above (applySinglesUnwrapping) but then check for it explicitly here

Why? applySinglesUnwrapping(outerSchema) only extracts the inner schema. That's only half the storey. The code still needs to build a serde that can handle the unwrapping, i.e. extracting the value of the single column from the Struct passed to serialize and the reverse for deserialize.

I feel like if I'm already checking the UNWRAP_SINGLES variable here, i might as well just extract the inner schema there as well. Not a biggie, but now we're checking it in two places and doing a no-op in one of them if it's disabled.

ksqldb-functional-tests/src/test/resources/query-validation-tests/elements.json

big-andy-coates · 2020-09-14T22:47:26Z

Thanks for the review @agavra. Yeah, it was a big one. Would of been a lot of work to break it into separate PRs and would have take a long time to chain the reviews together. It was definitely a one of those times when you pull on a thread and the thing just keeps unraveling...

Following on from the fixes and refactors done in confluentinc#6188, this commit pushes down the use of the Connect schema type to lower levels of the code. Higher levels of the code now deal with `LogicalSchema`, `PersistentSchema` or just a `List<SimpleColumn>`. This moves us closer to removing the Connect schema from the code base, except in the Serde code that deals with connect formats. `LogicalSchema` and `PersistentSchema` no longer know about the Connect schema type. Calls to retrieve the Connect schema from these types have been replaced with a util function that can convert a list of columns into a Struct Connect schema. As more code moves away from the Connect schema these util function calls will slowly be removed.

* refactor: push ConnectSchema down Following on from the fixes and refactors done in #6188, this commit pushes down the use of the Connect schema type to lower levels of the code. Higher levels of the code now deal with `LogicalSchema`, `PersistentSchema` or just a `List<SimpleColumn>`. This moves us closer to removing the Connect schema from the code base, except in the Serde code that deals with connect formats. `LogicalSchema` and `PersistentSchema` no longer know about the Connect schema type. Calls to retrieve the Connect schema from these types have been replaced with a util function that can convert a list of columns into a Struct Connect schema. As more code moves away from the Connect schema these util function calls will slowly be removed. Co-authored-by: Andy Coates <[email protected]>

big-andy-coates requested a review from a team as a code owner September 11, 2020 19:58

test: historic plans

8cf7cf1

agavra approved these changes Sep 14, 2020

View reviewed changes

chore: almog's requested changes

7c764c7

big-andy-coates mentioned this pull request Sep 14, 2020

refactor: push Connect schema down #6200

Merged

2 tasks

big-andy-coates merged commit cb25f9c into confluentinc:master Sep 15, 2020

big-andy-coates deleted the wrapping branch September 15, 2020 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: register correct unwrapped schema #6188

fix: register correct unwrapped schema #6188

big-andy-coates commented Sep 11, 2020 •

edited

Loading

agavra left a comment

agavra Sep 14, 2020

big-andy-coates Sep 14, 2020

agavra Sep 14, 2020

big-andy-coates commented Sep 14, 2020

fix: register correct unwrapped schema #6188

fix: register correct unwrapped schema #6188

Conversation

big-andy-coates commented Sep 11, 2020 • edited Loading

Description

Reviewing notes:

Testing done

Reviewer checklist

agavra left a comment

Choose a reason for hiding this comment

agavra Sep 14, 2020

Choose a reason for hiding this comment

big-andy-coates Sep 14, 2020

Choose a reason for hiding this comment

agavra Sep 14, 2020

Choose a reason for hiding this comment

big-andy-coates commented Sep 14, 2020

big-andy-coates commented Sep 11, 2020 •

edited

Loading