Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE when joining table to result of a table table join #1686

Closed
ppadovani opened this issue Aug 3, 2018 · 4 comments
Closed

NPE when joining table to result of a table table join #1686

ppadovani opened this issue Aug 3, 2018 · 4 comments
Assignees
Labels

Comments

@ppadovani
Copy link

ppadovani commented Aug 3, 2018

[2018-08-02 18:02:23,454] ERROR java.lang.NullPointerException io.confluent.ksql.planner.plan.JoinNode$TableToTableJoiner.join(JoinNode.java:479) io.confluent.ksql.planner.plan.JoinNode$TableToTableJoiner.join(JoinNode.java:448) io.confluent.ksql.planner.plan.JoinNode.buildStream(JoinNode.java:187) io.confluent.ksql.planner.plan.ProjectNode.buildStream(ProjectNode.java:124) io.confluent.ksql.planner.plan.KsqlBareOutputNode.buildStream(KsqlBareOutputNode.java:64) io.confluent.ksql.physical.PhysicalPlanBuilder.buildPhysicalPlan(PhysicalPlanBuilder.java:106) io.confluent.ksql.QueryEngine.buildQueryPhysicalPlan(QueryEngine.java:199)

Steps to reproduce:

bin/kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic person

bin/kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic email

bin/kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic personaldata

cat person.json | kafkacat -b localhost:9092 -t person -P -K:

cat email.json | kafkacat -b localhost:9092 -t email -P -K:

cat personal_data.json | kafkacat -b localhost:9092 -t personaldata -P -K:

person.json:
1234:{ "__civ_header_customerId": "85e2823d-806c-4376-83ed-1890890ecb15", "__civ_header_changeId":1533135989116, "EMPLID":"1234", "BIRTHDATE":1533136329, "BIRTHPLACE":"Druex", "BIRTHCOUNTRY":"Grance", "BIRTHSTATE":"France" }
email.json:
1234:{ "__civ_header_customerId":"85e2823d-806c-4376-83ed-1890890ecb15", "__civ_header_changeId":1533145979117, "EMPLID":"1234", "E_ADDR_TYPE":"work", "EMAIL_ADDR":"[email protected]", "PERF_EMAIL_FLAG":"X" }
personaldata.json:
1234:{ "__civ_header_customerId":"85e2823d-806c-4376-83ed-1890890ecb15", "changeId":1533135979113, "EMPLID":"1234", "FIRST_NAME":"Frank", "LAST_NAME":"Paris", "MIDDLE_NAME":"London", "COUNTRY":"France", "ADDRESS1":"1 Rue de Mort", "CITY":"Paris", "STATE":"Paris", "POSTAL":"1234", "SEX":"M", "MAR_STATUS":"SINGLE" }

  • Create Tables for each of the above topics using EMPLID as the key.
  • create table personemail with (partitions=1) as select * from email join person on email.emplid = person.emplid;
  • create table personemaildata as select * from personemail join personaldata on personemail.PERSON_EMPLID = personaldata.emplid;

The last command will cause the NPE to get thrown.

@apurvam
Copy link
Contributor

apurvam commented Aug 3, 2018

Thanks for the detailed report @ppadovani . It looks like a genuine bug (the keyField does not seem to be defined on the personemail table). I am curious: what happens if the left and right table on the join node are reversed (ie. if you do personaldata join personemail). My guess is that should work.

@apurvam apurvam assigned ppadovani and apurvam and unassigned ppadovani Aug 3, 2018
@apurvam apurvam added the bug label Aug 3, 2018
@apurvam
Copy link
Contributor

apurvam commented Aug 3, 2018

This seems to be a more general problem. If you do a create table foo_bar as select field_1, field_2 from some_other_table, the key field on the table foo_bar is not set. This is tricky, because the select expression may not even contain the key field and so it won't be in the value. In this case, if the table is used down stream in a way that a key is expected (like in a join), the query will fail (in this case with an NPE).

I think the solution here may take a bit of thought -- it is a fundamental issue that relates to the discussion in #804.

For now, a workaround that you can try is create a temporary table on top of personemail and specify the key explicitly for that one, and then use it in the downstream join. For instance, the tmp table below is tmp_personemail:

CREATE TABLE tmp_personemail (the_joined_fields_here) WITH (kafka_topic='PERSONEMAIL', key='emplid');
create table personemaildata as select * from tmp_personemail join personaldata on tmp_personemail.emplid = personaldata.emplid;

The tmp_personemail table is purely metadata and is used to explicitly set the key as metadata within KSQL so that the downstream join can work.

This should work, let me know if you see differently.

@ppadovani
Copy link
Author

Tried to reverse the join:
ksql> create table personemaildata as select * from personaldata join personemail on personaldata.EMPLID = personemail.EMPLID; Line: 1, Col: 101 : Invalid join criteria (PERSONALDATA.EMPLID = PERSONEMAIL.EMPLID). Key for PERSONEMAIL is not set correctly.

I can confirm that making an additional table to join to solves the issue as a workaround.

@apurvam
Copy link
Contributor

apurvam commented Aug 3, 2018

@rodesai will fix the bug where we don't set the key properly even if it is selected. The larger issue of having to select the key explicitly will be part of #804

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants