Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KSQL should not register a schema when it's only consuming from a topic #5553

Closed
CHR-LeeOlsen opened this issue Jun 4, 2020 · 7 comments
Closed
Assignees
Milestone

Comments

@CHR-LeeOlsen
Copy link

Describe the bug
When creating a stream over an existing topic with value_format=avro, ksql will register a dynamically generated schema. When using producers that don't generate the schema, this can cause it to attempt to use a schema that is incompatible with our data model.

To Reproduce
Steps to reproduce the behavior, include:

  1. version: confluentinc/ksqldb-server:0.9.0
  2. ksql:
CREATE STREAM appliedresearch_traffic_incident_updates_stream 
    (incident STRUCT<
        incidentId varchar, 
        cleared boolean, 
        confirmed boolean,
        details varchar,
        latitude double,
        longitude double,
        location varchar,
        roadwayDirection varchar, 
        roadwayName varchar,
        `timestamp` varchar,
        `type` varchar>)
WITH (
    kafka_topic = 'appliedresearch_traffic_incident_updates',
    value_format = 'avro'
);

Expected behavior
There should be no reason for ksql to register a schema here since it is only a consumer of the topic and would be using whatever schema was used to serialize the data.

Actual behaviour
It generates the following schema

{
  "connect.name": "io.confluent.ksql.avro_schemas.KsqlDataSourceSchema",
  "fields": [
    {
      "default": null,
      "name": "INCIDENT",
      "type": [
        "null",
        {
          "connect.name": "io.confluent.ksql.avro_schemas.KsqlDataSourceSchema_INCIDENT",
          "fields": [
            {
              "default": null,
              "name": "INCIDENTID",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "default": null,
              "name": "CLEARED",
              "type": [
                "null",
                "boolean"
              ]
            },
            {
              "default": null,
              "name": "CONFIRMED",
              "type": [
                "null",
                "boolean"
              ]
            },
            {
              "default": null,
              "name": "DETAILS",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "default": null,
              "name": "LATITUDE",
              "type": [
                "null",
                "double"
              ]
            },
            {
              "default": null,
              "name": "LONGITUDE",
              "type": [
                "null",
                "double"
              ]
            },
            {
              "default": null,
              "name": "LOCATION",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "default": null,
              "name": "ROADWAYDIRECTION",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "default": null,
              "name": "ROADWAYNAME",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "default": null,
              "name": "timestamp",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "default": null,
              "name": "type",
              "type": [
                "null",
                "string"
              ]
            }
          ],
          "name": "KsqlDataSourceSchema_INCIDENT",
          "type": "record"
        }
      ]
    }
  ],
  "name": "KsqlDataSourceSchema",
  "namespace": "io.confluent.ksql.avro_schemas",
  "type": "record"
}

Additional context
The auto generated schema is never used in any way because we produce to this topic ourselves. KSQL is only consuming from it. But this can break a producer under some use cases where the producer is not in control of the schema but rather using the latest version from the schema registry. There is really no value to KSQL generating this schema at all. It should only generate a schema for a topic it is writing to.

@CHR-LeeOlsen
Copy link
Author

This is related to a few other issues but different I think.

#3634
#2427

@derekjn derekjn added this to the 0.11.0 milestone Jun 12, 2020
@big-andy-coates
Copy link
Contributor

big-andy-coates commented Jun 19, 2020

I agree we shouldn't be registering a schema here.

A related issue, (linked above), is that we shouldn't be blindly overwriting otherwise compatible schemas either - but I think that's left to another discussion. Our whole handling of schemas needs a rethink.

However, the challenge here is that users could produce into this topic using INSERT INTO, which would require a schema to be registered. So it's not a simple a fix as we might first think.

@agavra has the most context / knowledge here...

Just another example of why INSERT INTO is evil and must DIE DIE DIE.

@agavra
Copy link
Contributor

agavra commented Jun 19, 2020

There's also the INSERT INTO ... VALUES, which can be pretty useful, and that utilizes the schema as well.

I think there are lots of places where reliance on schema registry might not make total sense for us - but the problem is that our serdes require schema registry. Perhaps we could mock a schema registry client that actually just looks things up in our metastore instead of looking into schema registry all the time (and only falling back to schema registry for topics that aren't in our metastore - like changelog topics, which fwiw we can also change to avoid using schema registry by having our own internal topic formats).

I'm not totally sure I know how I would solve this problem within the scope of 0.11 though... that's up for debate

@big-andy-coates
Copy link
Contributor

big-andy-coates commented Jun 19, 2020

yeah, the INSERT VALUES functionality is very useful, but we need to fix our ownership model. We shouldn't allow inserts into topics we don't 'own'.... a la KLIP-19: Introduce Materialized Views

so that too shouldn't be an issue longer term...

Maybe we should just prioritise those KLIPs :D

@big-andy-coates
Copy link
Contributor

I think there are lots of places where reliance on schema registry might not make total sense for us - but the problem is that our serdes require schema registry. Perhaps we could mock a schema registry client that actually just looks things up in our metastore instead of looking into schema registry all the time (and only falling back to schema registry for topics that aren't in our metastore - like changelog topics, which fwiw we can also change to avoid using schema registry by having our own internal topic formats).

I don't disagree, but I'm not sure that's on-topic.

@agavra agavra added enhancement P1 Slightly lower priority to P0 ;) and removed bug needs-triage labels Jun 23, 2020
@agavra agavra removed this from the 0.11.0 milestone Jun 23, 2020
@agavra agavra added P0 Denotes must-have for a given milestone needs-triage and removed P1 Slightly lower priority to P0 ;) labels Jun 30, 2020
@apurvam apurvam added blocker and removed P0 Denotes must-have for a given milestone needs-triage labels Jul 1, 2020
@agavra agavra self-assigned this Jul 1, 2020
@agavra
Copy link
Contributor

agavra commented Jul 1, 2020

Looks like this is actually a regression from the previous release caused by #4717 - I will change this to make sure that CREATE statements only register the schema if the schema does not already exist (i.e. the create statement was the one that created the topic in first place)

@agavra
Copy link
Contributor

agavra commented Jul 1, 2020

This issue will be fixed in the next release. Thanks for pointing it out @CHR-LeeOlsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants