Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table GROUP BY on keyfield or ROWKEY does unnecessary repartition step #3366

Open
big-andy-coates opened this issue Sep 17, 2019 · 2 comments

Comments

@big-andy-coates
Copy link
Contributor

big-andy-coates commented Sep 17, 2019

For example, (taken from key-field.json and enhanced to check for repartition topic)

    {
      "name": "table | initially set | group by (same) | key in value | no aliasing",
      "statements": [
        "CREATE TABLE INPUT (foo INT, bar INT) WITH (kafka_topic='input_topic', key='foo', value_format='JSON');",
        "CREATE TABLE OUTPUT AS SELECT foo, COUNT(*) FROM INPUT GROUP BY foo;"
      ],
      "inputs": [
        {"topic": "input_topic", "key": "1", "value": {"foo": 1, "bar": 2}}
      ],
      "outputs": [
        {"topic": "OUTPUT", "key": "1", "value": {"FOO": 1, "KSQL_COL_1":  1}}
      ],
      "post": {
        "sources": [
          {"name": "OUTPUT", "type": "table", "keyField": {"name": "FOO", "legacyName": "KSQL_INTERNAL_COL_0", "legacySchema": "STRING"}}
        ],
        "topics": {
          "blacklist": ".*-repartition"
        }
      }
    }

The above creates a table with a key field 'foo' and then does a group by on 'foo'. The topic/table is already keyed off 'foo', so no repartition should be required.

Admittedly, it's a bit of a strange GROUP BY, given there will be exactly one row per key.

So... we should either fix this so it doesn't repartition or just throw an error, given it's also possible to achieve the same output with the following:

    {
      "name": "version without GROUP BY",
      "statements": [
        "CREATE TABLE INPUT (foo INT, bar INT) WITH (kafka_topic='input_topic', key='foo', value_format='JSON');",
        "CREATE TABLE OUTPUT AS SELECT foo, 1 FROM INPUT;"
      ],
      "inputs": [
        {"topic": "input_topic", "key": "1", "value": {"foo": 1, "bar": 2}}
      ],
      "outputs": [
        {"topic": "OUTPUT", "key": "1", "value": {"FOO": 1, "KSQL_COL_1":  1}}
      ],
      "post": {
        "topics": {
          "blacklist": ".*-repartition"
        }
      }
    }
@big-andy-coates big-andy-coates modified the milestones: 6.0, 5.4 Sep 17, 2019
@big-andy-coates
Copy link
Contributor Author

Marked for release 6.0 as this would be a breaking change unless we added explicit handling, (i.e. complexity), to make this backwards compatible.

We could do it earlier and make it backwards compatible by effectively ignoring GROUP BY ROWKEY or GROUP BY keyField on tables for newer queries.

@big-andy-coates big-andy-coates changed the title Table GROUP BY on keyfield or ROWTIME does unnecessary repartition step Table GROUP BY on keyfield or ROWKEY does unnecessary repartition step Sep 17, 2019
@big-andy-coates
Copy link
Contributor Author

Might not be worth fixing is we deprecate KEY fields: #3537

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant