Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support multi-column key declarations #6544

Merged
merged 3 commits into from
Nov 6, 2020

Conversation

agavra
Copy link
Contributor

@agavra agavra commented Oct 29, 2020

partial implementation for #824 - all that's left is lifting the config flag

Description

Well @big-andy-coates does it again. I dug into this and hit some walls the first times around making me think it would be really tough, but after a few scrap-my-work-and-try-again loops I have proved Andy right and myself wrong - it was actually much easier to support multi-column keys than I suspected!

There are still imitations (you cannot issue pull queries against a multi-column key table and you cannot join a multi-column source that cannot be repartition) and the existing limitations still hold (group by still causes an ugly concatenation of the key) but will be fixed in future releases.

Testing done

Mostly fixed unit tests, added multi-key-cols.json which tests most of the basic scenarios.

TODOs

After this PR we will still need to:

  • split up the mega ksql.key.format.enabled config into multiple that target smaller diffs
  • schema inference work
  • make sure avro/pb work (all tests were done in JSON)
  • fix semantics
  • joins, pull query support
  • partition by multiple column support

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@agavra agavra force-pushed the multi-column-keys branch 3 times, most recently from 7b75de3 to 20d2c78 Compare October 29, 2020 21:44
@agavra agavra changed the title feat: support multi-column key declarations [DRAFT - testing still in progress] feat: support multi-column key declarations Oct 29, 2020
@agavra agavra marked this pull request as ready for review October 29, 2020 22:07
@agavra agavra requested a review from a team as a code owner October 29, 2020 22:07
@agavra agavra requested a review from vcrfxia October 29, 2020 22:07
@agavra agavra force-pushed the multi-column-keys branch 2 times, most recently from 348b36b to 20d2c78 Compare October 30, 2020 17:40
Copy link
Contributor

@vcrfxia vcrfxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff @agavra ! LGTM with the usual questions + nits inline.

if (seenKey.getValue().size() > 1) {
final String keys = GrammaticalJoiner.and().join(
seenKey.getValue().stream().map(Name::text).sorted());
throw new KsqlException("The projection contains a key column (" + seenKey.getKey()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to your PR, but can you help me understand why the validation that key columns are not selected more than once is performed in FinalProjectNode while the validation that key columns are selected is performed in DataSourceNode, rather than performing both checks in the same place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every node "validates" that the key is selected, they just all delegate to the child node that has that information:

// PlanNode.java:L145
  void validateKeyPresent(final SourceName sinkName, final Projection projection) {
    getSources().forEach(s -> s.validateKeyPresent(sinkName, projection));
  }

DataSourceNode, JoinNode and UserRepartitionNode all override this method because those are the nodes that truly "know" whether the key was selected. Imagine we projected the key column with a different name, the schema in the FinalProjectNode would consider that the key - but of course we didn't select that! We selected the original key.

To illustrate, in the example below the project node has a schema [col1 INT PRIMARY KEY] but we didn't select col1 - so it would fail! Of course we could do some magic and "remember" that b.col1 is actually a.id, but why not just delegate that to the node that already tracks that?

CREATE TABLE a (id INT PRIMARY KEY, col1 INT);
CREATE TABLE b AS SELECT id AS col1; 

(Note that it's a little confusing, because there is also VerifiableNode, which AggregateNode, FinalProjectNode and SuppressNode all override - and it is VerifiableNode#validateKeyPresent is what is called at the top level, but these just delegate down to the three source nodes eventually).

why the validation that key columns are not selected more than once is performed in FinalProjectNode

I think this one is clearer why it can't be in DataSourceNode, it's the other way around that might need some justification.


All that being said, I think this can definitely be made clearer going forward. The logical nodes don't follow a very strict abstraction model unfortunately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation -- super helpful!

}

public Struct build(final Object keyValue) {
public Struct build(final Object keyValue, final int fieldIndex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check my understanding: this works today because we only ever end up with a single key after partition by or group by, but this method signature will have to change to support building structs with multiple key columns in a future PR, yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - I tried to change this to make it more generic but it causes a shocking amount of code jitter, so I'll do that in a future PR

}
},
{
"name": "nested struct key",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this is a nested struct key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

physically, it's a struct within a struct. I'll rename this "struct as column in multi-column key"

}
}
]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a negative test for partition by on multiple columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally did not add this because our parser doesn't even except PARTITION BY x,y but i'll add one in and make sure it doesn't parse

@agavra agavra merged commit 2e1023e into confluentinc:master Nov 6, 2020
@agavra agavra deleted the multi-column-keys branch November 6, 2020 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants