feat(ksql-connect): poll connect-configs and auto register sources #3178

agavra · 2019-08-06T19:58:49Z

Description

This PR implements the most bare-bones mechanism to automatically import tables created from a JDBC connector. It should be noted that this PR is incremental and is the minimum chunk that I felt I could implement and put out in a single PR. See the Future Work section for upcoming PRs. Because of this, there are lots of limitations - BUT if you look at the testing done section it makes for a pretty 💣 💥 demo! What's left makes it much more robust.

Design

There are two main services introduced:

ConnectConfigService is in charge of listening to the connect-configs topic (value configurable) and polling /connectors endpoint, extracting "known" connector configurations, passing them to the ConnectPollingService. As of now, only JDBC source connector is a known connector.
ConnectPollingService is a scheduled service that runs every so often, scanning all kafka topics and seeing if any of them could have been created by a connector that was passed in by ConnectConfigService. If it does, it issues a CREATE TABLE request to the KSQL endpoint.

                  +-----------------------------------------+
                  |        KSQL                             |       +---------+
                  |                           +------------------>  | Kafka   |
                  |                           |             |       +---------+
                  |               +-----------+-----------+ |
                  |        +------+ ConnectPollingService | |
                  |        |      +--------+--+-----------+ |
                  |        |               ^  |             |       +---------+
                  |        |               |  +------------------>  | SR      |
+---------+       |        v               |                |       +---------+
|         |       | +------+--+   +--------+------------+   |
|  >cli   +-------->+  /ksql  |   | ConnectConfigService|   |
|         |       | +---+-----+   +----+---+------------+   |
+---------+       |     |              |   ^                |
                  |     |              |   |                |
                  |     |              |   |                |
                  +-----------------------------------------+
                        |              |   |
                 +------v-------+<-----+   |
                 |              |          |
                 |   connect    | +--------+--------+
                 |              | | connect configs |
                 +--------------+ +-----------------+

The diagram above describes the flow of creating a connector and having it automatically imported into KSQL (including what was implemented in #3149

Beyond that, the following classes were changed:

configs were added for (1) disabling this feature and (2) specifying the kafka topic for connect-configs
the KsqlConnect class simply wraps the two classes describes above into one easy to pass around component
the Connector class models information specific to each connectors (e.g. the topic.prefix config for JDBC connector) and Connectors helps create those
CreateConfigs (WITH clause) now accepts metadata describing which connector created the source. This is not used as of this PR but it was straightforward enough removing it was annoying.

Distributed System Concerns

Since multiple servers will be running this at the same time, we make sure that only one is in charge by having them all share a group.id when reading from connect-configs. If a server becomes the one assigned to read from connect-configs, it will reconstruct the entire state by calling /connectors and reading data from connect.

Security Concerns

cc @spena - since this is asynchronous, the KSQL principal will be the one who creates the table from the connect topic. Do you have any suggestions here with regards to the ksql security model?

Future Work

We need to add support for dropping these sources. As of this design, even if you DROP <SOURCE>, it will be re-created the next time ConnectPollingService runs
Work needs to be done in connect to expose a metadata topic to replace the connect-configs topic. When we do that ConnectPollingService can be removed.
We need to add DESCRIBE functionality to connectors, which will leverage the WITH clause change in this PR. This will be in the returned response for CREATE SOURCE CONNECTOR for improved usability
We will add AWAIT <SOURCE> so that users can wait until a certain stream is imported into KSQL
Full integration system tests for this and documentation of end-to-end connect integration
Simplify JDBC configuration for SMTs

Testing done

Unit tests mocking out most major components
End to end manual testing:

ksql> CREATE SOURCE CONNECTOR `jdbc-connector` WITH("connector.class"='io.confluent.connect.jdbc.JdbcSourceConnector',"tasks.max"='1',"connection.url"='jdbc:postgresql://localhost:5432/almog.gavra',"mode"='bulk',"topic.prefix"='jdbc-',"transforms"='createKey,extractString',"transforms.createKey.type"= 'org.apache.kafka.connect.transforms.ValueToKey', "transforms.createKey.fields"='username',"transforms.extractString.type"='org.apache.kafka.connect.transforms.ExtractField$Key',"transforms.extractString.field"='username', "key.converter"='org.apache.kafka.connect.storage.StringConverter');

 Message
----------------------------------
 Created connector jdbc-connector
----------------------------------

ksql> SHOW TABLES;

 Table Name           | Kafka Topic | Format | Windowed
--------------------------------------------------------
 JDBC_CONNECTOR_USERS | jdbc-users  | AVRO   | false
--------------------------------------------------------

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

hjafarpour · 2019-08-07T19:55:37Z

@agavra I have a couple of questions about the design before I start reviewing the PR.

It seems that we will consider all the existing connectors regardless of them being created by KSQL or not, am I correct?
Same as above, it seems that we will import all of the connect generated topics, regardless if the connector was created by KSQL or not, right?
How do we decide if we should create a stream or table for an imported topic? For instance, we can import stream or table using a JDBC connector.

agavra · 2019-08-07T20:04:17Z

Thanks @hjafarpour - answers in line.

It seems that we will consider all the existing connectors regardless of them being created by KSQL or not, am I correct?

Same as above, it seems that we will import all of the connect generated topics, regardless if the connector was created by KSQL or not, right?

Correct for both of these. I don't think we should distinguish between what was created by KSQL and what was not.

How do we decide if we should create a stream or table for an imported topic? For instance, we can import stream or table using a JDBC connector.

I talked about this with @rmoff just earlier today. As it stands, we're going to make this decision per "blessed" connector (e.g. JDBC will always use Table, and Kinesis will always use stream). I think this is an acceptable first step as long as we allow users to reimport things as the other if we made a mistake. In the future, I think we should bleed the stream table duality into connect, and allow the connectors to specify what it is that they are creating.

hjafarpour · 2019-08-07T20:39:27Z

Hmm, I think we indeed should distinguish between KSQL generated connectors and other ones. Otherwise, we would have significant complexity in managing KSQL generated connectors. For instance, should we be able to drop a connector that was created outside KSQL?
Also importing topic from other connectors also may cause security related issues. For instance, KSQL server may not have access to such topics.

agavra · 2019-08-07T21:01:09Z

Hmm, I think we indeed should distinguish between KSQL generated connectors and other ones. Otherwise, we would have significant complexity in managing KSQL generated connectors. For instance, should we be able to drop a connector that was created outside KSQL?
Also importing topic from other connectors also may cause security related issues. For instance, KSQL server may not have access to such topics.

Talked offline so I'm summarizing discussion here. There are two different concerns baked into that:

Security
UI Bloat (importing too much data)

I think the Security aspect should be handled transparently by some RBAC-like system and shouldn't impact the design choices here. Viewing, creating and dropping connectors should all go through the principal of the client issuing the command.

UI Bloat is another issue - most other systems (e.g. HIVE, Presto etc...) require each script to declare what it wants to use, and does not make them immediately usable. The approach suggested in this PR could cause too many streams/tables show up when you SHOW TABLES. With regards to this, it may be valuable storing metadata in the metastore regarding which connectors are created by KSQL and which are not - and then filter this set in the ConnectPollingService.

Since this feature is an addition, we can implement it in a follow-up PR.

hjafarpour

Two more questions before LGTM! :)

hjafarpour · 2019-08-07T21:48:37Z

ksql-engine/src/main/java/io/confluent/ksql/connect/ConnectConfigService.java

+ * {@link ConnectPollingService} to digest and register with KSQL.
+ *
+ * <p>On startup, this service reads the connect configuration topic from the
+ * beginning to make sure that it reconstructs the necessary state.</p>


What happens when connectors are terminated? Does Connect write such information to this topic? How do we deal with such scenarios where external systems create and terminate connectors?

Good questions! Answers inline:

What happens when connectors are terminated? Does Connect write such information to this topic?

Nothing happens because connect doesn't write any such information to the topic, but more importantly it doesn't delete the topics that it already created so I'm not sure what the expected behavior would be.

In the future when we support SHOW CONNECTORS, they would not show up.

How do we deal with such scenarios where external systems create and terminate connectors?

The beautiful thing about this design is that everything is decoupled, so whether the connector is created internally or externally, nothing changes!

hjafarpour · 2019-08-07T22:03:24Z

ksql-engine/src/main/java/io/confluent/ksql/connect/ConnectPollingService.java

+            topic, connector.getName(), source);
+        final Builder<String, Literal> builder = ImmutableMap.<String, Literal>builder()
+            .put(CommonCreateConfigs.KAFKA_TOPIC_NAME_PROPERTY, new StringLiteral(topic))
+            .put(CommonCreateConfigs.VALUE_FORMAT_PROPERTY, new StringLiteral("AVRO"))


Are we always import data in AVRO format? What if we want other formats?

For now, yes, since Schema Registry only supports AVRO. When it supports other types hopefully there would be some API that allows me to get what format it is and inject it here.

hjafarpour

LGTM!
As we discussed after the MVP we can revisit the design decisions.

rodesai

Thanks, @agavra! Left a bunch of comments inline.

ksql-engine/src/main/java/io/confluent/ksql/connect/ConnectConfigService.java

rodesai · 2019-08-08T21:25:11Z

ksql-engine/src/main/java/io/confluent/ksql/connect/ConnectPollingService.java

+      final String name = connector.getName();
+      final String source = connector.mapToSource(topic).toUpperCase();
+
+      // if the meta store already contains the source, don't send the extra command


should we be looking for sources that use topic rather than the exact source name the connector would use? E.g., if a user creates a stream and topic and then starts a connector, should we automatically create another stream? (I honestly don't know the answer here)

that might be a way to go about it, but that requires adding an index in the Metastore. For now, I just want to make sure I'm not spamming the command topic with unnecessary commands and this serves that purpose. We can always change that later, it would be mostly backwards compatible

ksql-engine/src/main/java/io/confluent/ksql/connect/ConnectConfigService.java

rodesai · 2019-08-09T17:33:51Z

ksql-engine/src/main/java/io/confluent/ksql/connect/ConnectPollingService.java

+
+  @Override
+  protected Scheduler scheduler() {
+    return Scheduler.newFixedRateSchedule(0, INTERVAL_S, TimeUnit.SECONDS);


One potential improvement could be to poll with a backoff in such a way that if we add a new connector we can poll with a short delay and back off until we hit the steady state interval. Out of scope for this change though.

Good suggestion - I was thinking of adding a whole suite of things to improve this (might change it from a scheduled service to something more custom that wakes up either every N seconds or when we add a connector, something like a blocking queue).

ksql-engine/src/main/java/io/confluent/ksql/connect/ConnectConfigService.java

rodesai · 2019-08-09T17:40:59Z

ksql-engine/src/main/java/io/confluent/ksql/connect/Connectors.java

+    }
+  }
+
+  private static Connector jdbc(final Map<String, String> properties) {


I think this fits better in its own class. Otherwise this will get out-of-hand as we add more connectors.

let's handle this problem as it comes. I was considering the best way to do it and I'm not sure that having a class per Connector will make it any cleaner - in fact I think it might make the amount of boilerplate annoying to refactor. I may be wrong, but refactoring in the future is easy :)

ksql-engine/src/main/java/io/confluent/ksql/connect/ConnectConfigService.java

ksql-engine/src/test/java/io/confluent/ksql/connect/ConnectConfigServiceTest.java

rodesai · 2019-08-09T17:58:00Z

ksql-engine/src/main/java/io/confluent/ksql/services/DefaultConnectClient.java

-  private static <T> ResponseHandler<ConnectResponse<T>> createHandler(
+  @SuppressWarnings("unchecked")
+  @Override
+  public ConnectResponse<List<String>> connectors() {


We should bake in some retries here for network/5xx errors. ditto below

I will do this in a follow-up PR

ksql-engine/src/test/java/io/confluent/ksql/connect/ConnectPollingServiceTest.java

rodesai

LGTM!

agavra requested a review from a team as a code owner August 6, 2019 19:58

agavra force-pushed the connect_poller branch from 0c48a39 to 473dbb2 Compare August 6, 2019 20:17

hjafarpour reviewed Aug 7, 2019

View reviewed changes

agavra requested a review from a team August 7, 2019 22:18

hjafarpour approved these changes Aug 7, 2019

View reviewed changes

agavra requested a review from a team August 7, 2019 23:06

agavra force-pushed the connect_poller branch from 1aee16d to 57d062d Compare August 9, 2019 16:19

rodesai reviewed Aug 9, 2019

View reviewed changes

agavra added 6 commits August 12, 2019 11:08

feat(ksql-connect): poll connect-configs and auto register sources

dbf3e03

refactor: flip config semantics of disable to enable

b07de44

fix: fix commit configs bug and add some logging

274ed86

feat: no longer read from topic, instead poll connect

4686053

fix: fix test issue NPE

36c5ec3

feat: address rohans comments

dff61f6

agavra force-pushed the connect_poller branch from 57d062d to dff61f6 Compare August 12, 2019 18:17

agavra requested a review from rodesai August 12, 2019 18:28

fix: fix findbug issue with await()

421911c

rodesai approved these changes Aug 12, 2019

View reviewed changes

agavra merged commit 6dd21fd into confluentinc:master Aug 12, 2019

agavra deleted the connect_poller branch August 12, 2019 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ksql-connect): poll connect-configs and auto register sources #3178

feat(ksql-connect): poll connect-configs and auto register sources #3178

agavra commented Aug 6, 2019 •

edited

Loading

hjafarpour commented Aug 7, 2019

agavra commented Aug 7, 2019

hjafarpour commented Aug 7, 2019

agavra commented Aug 7, 2019

hjafarpour left a comment

hjafarpour Aug 7, 2019

agavra Aug 7, 2019

hjafarpour Aug 7, 2019

agavra Aug 7, 2019

hjafarpour left a comment

rodesai left a comment

rodesai Aug 8, 2019

agavra Aug 12, 2019

rodesai Aug 9, 2019

agavra Aug 9, 2019

rodesai Aug 9, 2019

agavra Aug 12, 2019

rodesai Aug 9, 2019

agavra Aug 12, 2019

rodesai left a comment

feat(ksql-connect): poll connect-configs and auto register sources #3178

feat(ksql-connect): poll connect-configs and auto register sources #3178

Conversation

agavra commented Aug 6, 2019 • edited Loading

Description

Design

Distributed System Concerns

Security Concerns

Future Work

Testing done

Reviewer checklist

hjafarpour commented Aug 7, 2019

agavra commented Aug 7, 2019

hjafarpour commented Aug 7, 2019

agavra commented Aug 7, 2019

hjafarpour left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjafarpour left a comment

Choose a reason for hiding this comment

rodesai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rodesai left a comment

Choose a reason for hiding this comment

agavra commented Aug 6, 2019 •

edited

Loading