Table GROUP BY on keyfield or ROWKEY does unnecessary repartition step #3366

big-andy-coates · 2019-09-17T15:52:53Z

For example, (taken from key-field.json and enhanced to check for repartition topic)

    {
      "name": "table | initially set | group by (same) | key in value | no aliasing",
      "statements": [
        "CREATE TABLE INPUT (foo INT, bar INT) WITH (kafka_topic='input_topic', key='foo', value_format='JSON');",
        "CREATE TABLE OUTPUT AS SELECT foo, COUNT(*) FROM INPUT GROUP BY foo;"
      ],
      "inputs": [
        {"topic": "input_topic", "key": "1", "value": {"foo": 1, "bar": 2}}
      ],
      "outputs": [
        {"topic": "OUTPUT", "key": "1", "value": {"FOO": 1, "KSQL_COL_1":  1}}
      ],
      "post": {
        "sources": [
          {"name": "OUTPUT", "type": "table", "keyField": {"name": "FOO", "legacyName": "KSQL_INTERNAL_COL_0", "legacySchema": "STRING"}}
        ],
        "topics": {
          "blacklist": ".*-repartition"
        }
      }
    }

The above creates a table with a key field 'foo' and then does a group by on 'foo'. The topic/table is already keyed off 'foo', so no repartition should be required.

Admittedly, it's a bit of a strange GROUP BY, given there will be exactly one row per key.

So... we should either fix this so it doesn't repartition or just throw an error, given it's also possible to achieve the same output with the following:

    {
      "name": "version without GROUP BY",
      "statements": [
        "CREATE TABLE INPUT (foo INT, bar INT) WITH (kafka_topic='input_topic', key='foo', value_format='JSON');",
        "CREATE TABLE OUTPUT AS SELECT foo, 1 FROM INPUT;"
      ],
      "inputs": [
        {"topic": "input_topic", "key": "1", "value": {"foo": 1, "bar": 2}}
      ],
      "outputs": [
        {"topic": "OUTPUT", "key": "1", "value": {"FOO": 1, "KSQL_COL_1":  1}}
      ],
      "post": {
        "topics": {
          "blacklist": ".*-repartition"
        }
      }
    }

The text was updated successfully, but these errors were encountered:

big-andy-coates · 2019-09-17T16:01:52Z

Marked for release 6.0 as this would be a breaking change unless we added explicit handling, (i.e. complexity), to make this backwards compatible.

We could do it earlier and make it backwards compatible by effectively ignoring GROUP BY ROWKEY or GROUP BY keyField on tables for newer queries.

big-andy-coates · 2019-10-11T12:20:46Z

Might not be worth fixing is we deprecate KEY fields: #3537

big-andy-coates modified the milestones: 6.0, 5.4 Sep 17, 2019

big-andy-coates changed the title ~~Table GROUP BY on keyfield or ROWTIME does unnecessary repartition step~~ Table GROUP BY on keyfield or ROWKEY does unnecessary repartition step Sep 17, 2019

big-andy-coates mentioned this issue Sep 17, 2019

Stream GROUP BY on ROWKEY does unnecessary repartition step #3367

Closed

big-andy-coates added the engine label Sep 26, 2019

big-andy-coates added the breaking-change label Nov 22, 2019

big-andy-coates modified the milestones: 6.0, 1.0.0 Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table GROUP BY on keyfield or ROWKEY does unnecessary repartition step #3366

Table GROUP BY on keyfield or ROWKEY does unnecessary repartition step #3366

big-andy-coates commented Sep 17, 2019 •

edited

Loading

big-andy-coates commented Sep 17, 2019

big-andy-coates commented Oct 11, 2019

Table GROUP BY on keyfield or ROWKEY does unnecessary repartition step #3366

Table GROUP BY on keyfield or ROWKEY does unnecessary repartition step #3366

Comments

big-andy-coates commented Sep 17, 2019 • edited Loading

big-andy-coates commented Sep 17, 2019

big-andy-coates commented Oct 11, 2019

big-andy-coates commented Sep 17, 2019 •

edited

Loading