Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ksql-connect): poll connect-configs and auto register sources #3178

Merged
merged 7 commits into from
Aug 12, 2019

Conversation

agavra
Copy link
Contributor

@agavra agavra commented Aug 6, 2019

Description

This PR implements the most bare-bones mechanism to automatically import tables created from a JDBC connector. It should be noted that this PR is incremental and is the minimum chunk that I felt I could implement and put out in a single PR. See the Future Work section for upcoming PRs. Because of this, there are lots of limitations - BUT if you look at the testing done section it makes for a pretty 💣 💥 demo! What's left makes it much more robust.

Design

There are two main services introduced:

  • ConnectConfigService is in charge of listening to the connect-configs topic (value configurable) and polling /connectors endpoint, extracting "known" connector configurations, passing them to the ConnectPollingService. As of now, only JDBC source connector is a known connector.
  • ConnectPollingService is a scheduled service that runs every so often, scanning all kafka topics and seeing if any of them could have been created by a connector that was passed in by ConnectConfigService. If it does, it issues a CREATE TABLE request to the KSQL endpoint.
                  +-----------------------------------------+
                  |        KSQL                             |       +---------+
                  |                           +------------------>  | Kafka   |
                  |                           |             |       +---------+
                  |               +-----------+-----------+ |
                  |        +------+ ConnectPollingService | |
                  |        |      +--------+--+-----------+ |
                  |        |               ^  |             |       +---------+
                  |        |               |  +------------------>  | SR      |
+---------+       |        v               |                |       +---------+
|         |       | +------+--+   +--------+------------+   |
|  >cli   +-------->+  /ksql  |   | ConnectConfigService|   |
|         |       | +---+-----+   +----+---+------------+   |
+---------+       |     |              |   ^                |
                  |     |              |   |                |
                  |     |              |   |                |
                  +-----------------------------------------+
                        |              |   |
                 +------v-------+<-----+   |
                 |              |          |
                 |   connect    | +--------+--------+
                 |              | | connect configs |
                 +--------------+ +-----------------+

The diagram above describes the flow of creating a connector and having it automatically imported into KSQL (including what was implemented in #3149

Beyond that, the following classes were changed:

  • configs were added for (1) disabling this feature and (2) specifying the kafka topic for connect-configs
  • the KsqlConnect class simply wraps the two classes describes above into one easy to pass around component
  • the Connector class models information specific to each connectors (e.g. the topic.prefix config for JDBC connector) and Connectors helps create those
  • CreateConfigs (WITH clause) now accepts metadata describing which connector created the source. This is not used as of this PR but it was straightforward enough removing it was annoying.

Distributed System Concerns

Since multiple servers will be running this at the same time, we make sure that only one is in charge by having them all share a group.id when reading from connect-configs. If a server becomes the one assigned to read from connect-configs, it will reconstruct the entire state by calling /connectors and reading data from connect.

Security Concerns

cc @spena - since this is asynchronous, the KSQL principal will be the one who creates the table from the connect topic. Do you have any suggestions here with regards to the ksql security model?

Future Work

  • We need to add support for dropping these sources. As of this design, even if you DROP <SOURCE>, it will be re-created the next time ConnectPollingService runs
  • Work needs to be done in connect to expose a metadata topic to replace the connect-configs topic. When we do that ConnectPollingService can be removed.
  • We need to add DESCRIBE functionality to connectors, which will leverage the WITH clause change in this PR. This will be in the returned response for CREATE SOURCE CONNECTOR for improved usability
  • We will add AWAIT <SOURCE> so that users can wait until a certain stream is imported into KSQL
  • Full integration system tests for this and documentation of end-to-end connect integration
  • Simplify JDBC configuration for SMTs

Testing done

  • Unit tests mocking out most major components
  • End to end manual testing:
ksql> CREATE SOURCE CONNECTOR `jdbc-connector` WITH("connector.class"='io.confluent.connect.jdbc.JdbcSourceConnector',"tasks.max"='1',"connection.url"='jdbc:postgresql://localhost:5432/almog.gavra',"mode"='bulk',"topic.prefix"='jdbc-',"transforms"='createKey,extractString',"transforms.createKey.type"= 'org.apache.kafka.connect.transforms.ValueToKey', "transforms.createKey.fields"='username',"transforms.extractString.type"='org.apache.kafka.connect.transforms.ExtractField$Key',"transforms.extractString.field"='username', "key.converter"='org.apache.kafka.connect.storage.StringConverter');

 Message
----------------------------------
 Created connector jdbc-connector
----------------------------------

ksql> SHOW TABLES;

 Table Name           | Kafka Topic | Format | Windowed
--------------------------------------------------------
 JDBC_CONNECTOR_USERS | jdbc-users  | AVRO   | false
--------------------------------------------------------

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@agavra agavra requested a review from a team as a code owner August 6, 2019 19:58
@hjafarpour
Copy link
Contributor

@agavra I have a couple of questions about the design before I start reviewing the PR.

  • It seems that we will consider all the existing connectors regardless of them being created by KSQL or not, am I correct?

  • Same as above, it seems that we will import all of the connect generated topics, regardless if the connector was created by KSQL or not, right?

  • How do we decide if we should create a stream or table for an imported topic? For instance, we can import stream or table using a JDBC connector.

@agavra
Copy link
Contributor Author

agavra commented Aug 7, 2019

Thanks @hjafarpour - answers in line.

  • It seems that we will consider all the existing connectors regardless of them being created by KSQL or not, am I correct?
  • Same as above, it seems that we will import all of the connect generated topics, regardless if the connector was created by KSQL or not, right?

Correct for both of these. I don't think we should distinguish between what was created by KSQL and what was not.

  • How do we decide if we should create a stream or table for an imported topic? For instance, we can import stream or table using a JDBC connector.

I talked about this with @rmoff just earlier today. As it stands, we're going to make this decision per "blessed" connector (e.g. JDBC will always use Table, and Kinesis will always use stream). I think this is an acceptable first step as long as we allow users to reimport things as the other if we made a mistake. In the future, I think we should bleed the stream table duality into connect, and allow the connectors to specify what it is that they are creating.

@hjafarpour
Copy link
Contributor

Hmm, I think we indeed should distinguish between KSQL generated connectors and other ones. Otherwise, we would have significant complexity in managing KSQL generated connectors. For instance, should we be able to drop a connector that was created outside KSQL?
Also importing topic from other connectors also may cause security related issues. For instance, KSQL server may not have access to such topics.

@agavra
Copy link
Contributor Author

agavra commented Aug 7, 2019

Hmm, I think we indeed should distinguish between KSQL generated connectors and other ones. Otherwise, we would have significant complexity in managing KSQL generated connectors. For instance, should we be able to drop a connector that was created outside KSQL?
Also importing topic from other connectors also may cause security related issues. For instance, KSQL server may not have access to such topics.

Talked offline so I'm summarizing discussion here. There are two different concerns baked into that:

  • Security
  • UI Bloat (importing too much data)

I think the Security aspect should be handled transparently by some RBAC-like system and shouldn't impact the design choices here. Viewing, creating and dropping connectors should all go through the principal of the client issuing the command.

UI Bloat is another issue - most other systems (e.g. HIVE, Presto etc...) require each script to declare what it wants to use, and does not make them immediately usable. The approach suggested in this PR could cause too many streams/tables show up when you SHOW TABLES. With regards to this, it may be valuable storing metadata in the metastore regarding which connectors are created by KSQL and which are not - and then filter this set in the ConnectPollingService.

Since this feature is an addition, we can implement it in a follow-up PR.

Copy link
Contributor

@hjafarpour hjafarpour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more questions before LGTM! :)

* {@link ConnectPollingService} to digest and register with KSQL.
*
* <p>On startup, this service reads the connect configuration topic from the
* beginning to make sure that it reconstructs the necessary state.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when connectors are terminated? Does Connect write such information to this topic? How do we deal with such scenarios where external systems create and terminate connectors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions! Answers inline:

What happens when connectors are terminated? Does Connect write such information to this topic?

Nothing happens because connect doesn't write any such information to the topic, but more importantly it doesn't delete the topics that it already created so I'm not sure what the expected behavior would be.

In the future when we support SHOW CONNECTORS, they would not show up.

How do we deal with such scenarios where external systems create and terminate connectors?

The beautiful thing about this design is that everything is decoupled, so whether the connector is created internally or externally, nothing changes!

topic, connector.getName(), source);
final Builder<String, Literal> builder = ImmutableMap.<String, Literal>builder()
.put(CommonCreateConfigs.KAFKA_TOPIC_NAME_PROPERTY, new StringLiteral(topic))
.put(CommonCreateConfigs.VALUE_FORMAT_PROPERTY, new StringLiteral("AVRO"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we always import data in AVRO format? What if we want other formats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, yes, since Schema Registry only supports AVRO. When it supports other types hopefully there would be some API that allows me to get what format it is and inject it here.

@agavra agavra requested a review from a team August 7, 2019 22:18
Copy link
Contributor

@hjafarpour hjafarpour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
As we discussed after the MVP we can revisit the design decisions.

@agavra agavra requested a review from a team August 7, 2019 23:06
Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @agavra! Left a bunch of comments inline.

final String name = connector.getName();
final String source = connector.mapToSource(topic).toUpperCase();

// if the meta store already contains the source, don't send the extra command
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be looking for sources that use topic rather than the exact source name the connector would use? E.g., if a user creates a stream and topic and then starts a connector, should we automatically create another stream? (I honestly don't know the answer here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that might be a way to go about it, but that requires adding an index in the Metastore. For now, I just want to make sure I'm not spamming the command topic with unnecessary commands and this serves that purpose. We can always change that later, it would be mostly backwards compatible


@Override
protected Scheduler scheduler() {
return Scheduler.newFixedRateSchedule(0, INTERVAL_S, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential improvement could be to poll with a backoff in such a way that if we add a new connector we can poll with a short delay and back off until we hit the steady state interval. Out of scope for this change though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion - I was thinking of adding a whole suite of things to improve this (might change it from a scheduled service to something more custom that wakes up either every N seconds or when we add a connector, something like a blocking queue).

}
}

private static Connector jdbc(final Map<String, String> properties) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fits better in its own class. Otherwise this will get out-of-hand as we add more connectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's handle this problem as it comes. I was considering the best way to do it and I'm not sure that having a class per Connector will make it any cleaner - in fact I think it might make the amount of boilerplate annoying to refactor. I may be wrong, but refactoring in the future is easy :)

private static <T> ResponseHandler<ConnectResponse<T>> createHandler(
@SuppressWarnings("unchecked")
@Override
public ConnectResponse<List<String>> connectors() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should bake in some retries here for network/5xx errors. ditto below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this in a follow-up PR

Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@agavra agavra merged commit 6dd21fd into confluentinc:master Aug 12, 2019
@agavra agavra deleted the connect_poller branch August 12, 2019 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants