[FLINK-35894] Add Elasticsearch Sink Connector for Flink CDC Pipeline #3495

proletarians · 2024-07-25T09:31:19Z

This commit introduces the Elasticsearch Sink Connector for Flink CDC Pipeline. It includes:

Configuration options for Elasticsearch sink
Serialization logic for Elasticsearch events
Data type conversion utilities
Elasticsearch sink implementation
Factory for creating Elasticsearch data sinks

These changes enable Flink CDC to efficiently stream data changes to Elasticsearch.

lvyanquan · 2024-07-25T10:21:30Z

...cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-elasticsearch/pom.xml

+    <parent>
+        <groupId>org.apache.flink</groupId>
+        <artifactId>flink-cdc-pipeline-connectors</artifactId>
+        <version>3.2-SNAPSHOT</version>


please use <version>${revision}</version> like other connectors.

lvyanquan · 2024-07-25T10:23:13Z

...cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-elasticsearch/pom.xml

+        <dependency>
+            <groupId>org.apache.flink</groupId>
+            <artifactId>flink-cdc-composer</artifactId>
+            <version>3.2-SNAPSHOT</version>


use <version>${project.version}</version>

lvyanquan · 2024-07-25T11:04:18Z

...main/java/org/apache/flink/cdc/connectors/elasticsearch/config/ElasticsearchSinkOptions.java

+import java.io.Serializable;
+import java.util.List;
+
+/** DorisDataSink Options reference {@link ElasticsearchSinkOptions}. */


ElasticsearchDataSink

lvyanquan · 2024-07-25T11:18:59Z

...cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-elasticsearch/pom.xml

+        <jackson.version>2.13.2</jackson.version>
+        <surefire.module.config>--add-opens=java.base/java.util=ALL-UNNAMED</surefire.module.config>
+        <testcontainers.version>1.16.0</testcontainers.version>
+    </properties>


Some properties were already provided, this can be simplified by:

<properties> <elasticsearch.version>8.12.1</elasticsearch.version> </properties>

lvyanquan · 2024-07-25T11:22:36Z

...in/java/org/apache/flink/cdc/connectors/elasticsearch/sink/ElasticsearchDataSinkFactory.java

+
+    private ElasticsearchSinkOptions buildSinkConnectorOptions(Configuration cdcConfig) {
+        List<HttpHost> hosts = parseHosts(cdcConfig.get(HOSTS));
+        NetworkConfig networkConfig = new NetworkConfig(hosts, null, null, null, null, null);


It seems that username and password were not passed.

lvyanquan · 2024-07-25T11:33:42Z

...nectors/flink-cdc-pipeline-connector-elasticsearch/src/test/resources/log4j2-test.properties

+appender.testlogger.type = CONSOLE
+appender.testlogger.target = SYSTEM_ERR
+appender.testlogger.layout.type = PatternLayout
+appender.testlogger.layout.pattern = %-4r [%t] %-5p %c - %m%n


please add a org.apache.flink.cdc.common.factories.Factory file for SPI like other connectors.

lvyanquan · 2024-07-25T11:41:34Z

...earch/src/main/java/org/apache/flink/cdc/connectors/elasticsearch/serializer/RecordData.java

+ * A class representing a record with multiple fields of various types. Provides methods to access
+ * fields by position and type.
+ */
+public class RecordData {


I don't see anywhere to use this class, is it necessary?

lvyanquan · 2024-07-25T12:00:28Z

...a/org/apache/flink/cdc/connectors/elasticsearch/serializer/ElasticsearchEventSerializer.java

+            }
+            Schema updatedSchema =
+                    SchemaUtils.applySchemaChangeEvent(schemaMaps.get(tableId), schemaChangeEvent);
+            schemaMaps.put(tableId, updatedSchema);


If we modify column name in upstream, do we need to call createSchemaIndexOperation method

lvyanquan · 2024-07-25T12:06:29Z

...java/org/apache/flink/cdc/connectors/elasticsearch/sink/ElasticsearchDataSinkITCaseTest.java

+
+        runJobWithEvents(events);
+
+        verifyInsertedData(tableId, "2", 2, 2.0, "value2");


Please add a test for processing all supported data types.

lvyanquan · 2024-07-25T12:11:14Z

...java/org/apache/flink/cdc/connectors/elasticsearch/sink/ElasticsearchDataSinkITCaseTest.java

+    }
+
+    private void verifyInsertedData(
+            TableId tableId, String id, int expectedId, double expectedNumber, String expectedName)


We can pass List<String> for column name and List<Object> for expected object to simplify verification.

lvyanquan · 2024-07-25T12:14:17Z

Thanks @proletarians for this contribution, left some comments.

lvyanquan · 2024-07-26T01:33:42Z

...rc/main/java/org/apache/flink/cdc/connectors/elasticsearch/v2/Elasticsearch8AsyncWriter.java

+ *
+ * @param <InputT> type of Operations
+ */
+public class Elasticsearch8AsyncWriter<InputT> extends AsyncSinkWriter<InputT, Operation> {


Do we support writing to es7 using this Writer?

lvyanquan · 2024-07-26T02:03:08Z

...a/org/apache/flink/cdc/connectors/elasticsearch/serializer/ElasticsearchEventSerializer.java

+            case CHAR:
+            case VARCHAR:
+                return recordData.getString(index);
+            default:


Looks like Decimal/TimeStamp/Date type were missed, do you plan to support that?

lvyanquan · 2024-07-26T02:06:45Z

...a/org/apache/flink/cdc/connectors/elasticsearch/serializer/ElasticsearchEventSerializer.java

+        for (int i = 0; i < recordData.getArity(); i++) {
+            Column column = columns.get(i);
+            ColumnType columnType = ColumnType.valueOf(column.getType().getTypeRoot().name());
+            ElasticsearchRowConverter.SerializationConverter converter =


Can we cache TableId and list of converter to avoid creating convert everytime we meet a old TableId?

lvyanquan · 2024-07-26T02:22:29Z

...a/org/apache/flink/cdc/connectors/elasticsearch/serializer/ElasticsearchEventSerializer.java

+                schema.getColumns().stream()
+                        .map(Column::asSummaryString)
+                        .collect(Collectors.toList()));
+        schemaMap.put("primaryKeys", schema.primaryKeys());


Where did these keys come from, Is this something we customized.

lvyanquan · 2024-08-01T07:15:39Z

...in/java/org/apache/flink/cdc/connectors/elasticsearch/sink/ElasticsearchDataSinkOptions.java

+            ConfigOptions.key("index")
+                    .stringType()
+                    .noDefaultValue()
+                    .withDescription("The Elasticsearch index name to write to.");


Actually, I don't see any place that uses this option.
What about setting all index name to this value if this config was set, otherwise use tableId as index name, like topic in Kafka connector.

lvyanquan · 2024-08-01T07:23:51Z

...a/org/apache/flink/cdc/connectors/elasticsearch/serializer/ElasticsearchEventSerializer.java

+                .build();
+    }
+
+    private BulkOperationVariant applyDataChangeEvent(DataChangeEvent event)


It's better to named this method createBulkOperationVariant.

lvyanquan · 2024-08-01T07:30:36Z

...a/org/apache/flink/cdc/connectors/elasticsearch/serializer/ElasticsearchEventSerializer.java

+                                                                    "Primary key column not found: "
+                                                                            + primaryKey));
+                            int index = schema.getColumns().indexOf(column);
+                            return getFieldValue(recordData, column.getType(), index);


Can be replaced with

converterCache.get(TableId.tableId("your_tableId")).get(index).serialize(index, recordData);

here you could pass TableId of this RecordData.

lvyanquan · 2024-08-01T07:38:04Z

...a/org/apache/flink/cdc/connectors/elasticsearch/serializer/ElasticsearchEventSerializer.java

+        if (schemaChangeEvent instanceof CreateTableEvent) {
+            Schema schema = ((CreateTableEvent) schemaChangeEvent).getSchema();
+            schemaMaps.put(tableId, schema);
+            return createSchemaIndexOperation(tableId, schema);


We should move this logic to ElasticsearchMetadataApplier as there will have more than one subtasks to receive the SchemaChangeEvent.

lvyanquan · 2024-08-01T07:44:21Z

...a/org/apache/flink/cdc/connectors/elasticsearch/serializer/ElasticsearchEventSerializer.java

+                            new ArrayList<>();
+                    for (Column column : schema.getColumns()) {
+                        ColumnType columnType =
+                                ColumnType.valueOf(column.getType().getTypeRoot().name());


This conversion will lead to precise lost, and it's unnecessary.

lvyanquan · 2024-08-01T07:49:33Z

...java/org/apache/flink/cdc/connectors/elasticsearch/serializer/ElasticsearchRowConverter.java

+                // Decimal type
+            case DECIMAL:
+                return (pos, data) -> {
+                    DecimalData decimalData = data.getDecimal(pos, 17, 2);


We should get the precise from original data type.

lvyanquan · 2024-08-09T08:11:43Z

...main/java/org/apache/flink/cdc/connectors/elasticsearch/config/ElasticsearchSinkOptions.java

+    private final long maxTimeInBufferMS;
+    private final long maxRecordSizeInBytes;
+    private final NetworkConfig networkConfig;
+    private final int version; // 新增字段


Avoid using Chinese language in comments.

These comments are still existed.

lvyanquan · 2024-08-09T08:25:00Z

...cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-elasticsearch/pom.xml

+        <slf4j.version>1.7.32</slf4j.version>
+        <junit.platform.version>1.10.2</junit.platform.version>
+        <paimon.version>0.7.0-incubating</paimon.version>
+        <hadoop.version>2.8.5</hadoop.version>


Unrelated version can be removed.

<paimon.version>0.7.0-incubating</paimon.version> <hadoop.version>2.8.5</hadoop.version> <hive.version>2.3.9</hive.version>

These dependencies are unnecessary and can be removed.

lvyanquan · 2024-08-09T08:36:00Z

...earch/src/main/java/org/apache/flink/cdc/connectors/elasticsearch/serializer/ColumnType.java

+ * various data types that can be used in database columns and are relevant for serialization and
+ * deserialization processes.
+ */
+public enum ColumnType {


This class is now useless and can be removed.

lvyanquan · 2024-08-09T09:04:36Z

...arch/src/main/java/org/apache/flink/cdc/connectors/elasticsearch/v2/OperationSerializer.java

+            }
+        } catch (Exception e) {
+            // Handle the exception as needed, e.g., log the error
+            System.err.println("Failed to deserialize Operation: " + e.getMessage());


It's better to use Logger to display this.

lvyanquan · 2024-08-09T09:09:21Z

...ava/org/apache/flink/cdc/connectors/elasticsearch/sink/Elasticsearch6DataSinkITCaseTest.java

+        assertThat(response.getSource().get("extra_bool")).isEqualTo(expectedExtraBool);
+    }
+
+    private List<Event> createTestEvents(TableId tableId) {


Looks like createTestEvents/createTestEventsWithDelete/createTestEventsWithAddColumn are common method in Elasticsearch6DataSinkITCaseTest/Elasticsearch7DataSinkITCaseTest/ElasticsearchDataSinkITCaseTest, we can extract them to an util class of a parent class.

lvyanquan · 2024-08-11T01:59:13Z

...main/java/org/apache/flink/cdc/connectors/elasticsearch/config/ElasticsearchSinkOptions.java

+    private final long maxTimeInBufferMS;
+    private final long maxRecordSizeInBytes;
+    private final NetworkConfig networkConfig;
+    private final int version; // 新增字段


These comments are still existed.

lvyanquan · 2024-08-11T01:59:54Z

...ava/org/apache/flink/cdc/connectors/elasticsearch/sink/Elasticsearch6DataSinkITCaseTest.java

+import org.apache.flink.cdc.common.event.TableId;
+import org.apache.flink.cdc.common.sink.FlinkSinkProvider;
+import org.apache.flink.cdc.connectors.elasticsearch.config.ElasticsearchSinkOptions;
+import org.apache.flink.cdc.connectors.elasticsearch.sink.utils.ElasticsearchTestUtils;


This class was not uploaded.

lvyanquan · 2024-08-11T02:33:09Z

...a/org/apache/flink/cdc/connectors/elasticsearch/serializer/ElasticsearchEventSerializer.java

+                                                            new IllegalStateException(
+                                                                    "Primary key column not found: "
+                                                                            + primaryKey));
+                            int index = schema.getColumns().indexOf(column);


It's better to use a for loop to avoid traversing twice.

lvyanquan

thanks @proletarians for this contribution，LGTM.

lvyanquan · 2024-08-11T10:37:16Z

hi @leonardBang, could you please help to
trige CI check?

yuxiqian · 2024-08-12T01:21:03Z

Hi @proletarians,

License check is failing since it detects some questionable licenses. Seems jakarta.json/jakarta.json-api and org.eclipse.parsson/parsson library are released in EPL 2.0 license (and an alternative GPL-2.0 license), which is marked as "Category B", which needs manual check to verify if it's compatible with ASF policies[1].

By quickly checking licenses declared by jakarta.json/jakarta.json-api and org.eclipse.parsson/parsson, I believe they could be legally packaged into binary release.

You may apply the folllowing patch to suppress the false alarm:

--- a/tools/ci/license_check.rb	(revision 44ea2a73d662b6adaae1c79beb3f1ab3e37c6278)
+++ b/tools/ci/license_check.rb	(date 1723425566467)
@@ -75,6 +75,9 @@
   'org.glassfish.jersey', # dual-licensed under GPL 2 and EPL 2.0
   'org.glassfish.hk2', # dual-licensed under GPL 2 and EPL 2.0
   'javax.ws.rs-api', # dual-licensed under GPL 2 and EPL 2.0
+  'jakarta.json-api', # dual-licensed under GPL 2 and EPL 2.0
+  'org.eclipse.parsson', # EPL 2.0
+  'org/eclipse/parsson', # EPL 2.0
   'jakarta.ws.rs' # dual-licensed under GPL 2 and EPL 2.0
 ].freeze
 
@@ -110,7 +113,7 @@
   Zip::File.open(jar_file) do |jar|
     jar.filter { |e| e.ftype == :file }
        .filter { |e| !File.basename(e.name).downcase.end_with?(*BINARY_FILE_EXTENSIONS) }
-       .filter { |e| !File.basename(e.name).downcase.start_with? 'license', 'dependencies' }
+       .filter { |e| !File.basename(e.name).downcase.start_with? 'license', 'dependencies', 'notice' }
        .filter { |e| EXCEPTION_PACKAGES.none? { |ex| e.name.include? ex } }
        .map do |e|
          content = e.get_input_stream.read.force_encoding('UTF-8')

[1] https://www.apache.org/legal/resolved.html

yuxiqian

Thanks for @proletarians' great work, just left some comments.

yuxiqian · 2024-08-12T01:33:29Z

...tors/flink-cdc-pipeline-connector-elasticsearch/src/test/resources/testcontainers.properties

Why explicitly specifying a much older version of ryuk here?

yuxiqian · 2024-08-12T01:37:59Z

flink-cdc-connect/flink-cdc-pipeline-connectors/pom.xml

Minor: please also declare this new connector into .github/labeler.yml to label PRs correctly.

yuxiqian · 2024-08-12T01:46:53Z

...ava/org/apache/flink/cdc/connectors/elasticsearch/sink/Elasticsearch6DataSinkITCaseTest.java

+
+    private static final Logger LOG =
+            LoggerFactory.getLogger(ElasticsearchDataSinkITCaseTest.class);
+    private static final String ELASTICSEARCH_VERSION = "7.10.2";


Why Elasticsearch6DataSinkITCaseTest runs on ES 7.10.2?

yuxiqian

Thanks for @proletarians' rapid response!

leonardBang · 2024-08-12T07:44:30Z

@proletarians Could you check the failed CI?

leonardBang

Thanks @proletarians for the nice work and @lvyanquan and @yuxiqian for the review work, +1

…nnector for Flink CDC Pipeline

…versions This closes apache#3495.

lvyanquan reviewed Jul 25, 2024

View reviewed changes

lvyanquan reviewed Jul 26, 2024

View reviewed changes

lvyanquan reviewed Aug 1, 2024

View reviewed changes

github-actions bot added the e2e-tests label Aug 7, 2024

lvyanquan reviewed Aug 9, 2024

View reviewed changes

proletarians force-pushed the v1.2 branch from 6ffa45f to 81ad560 Compare August 9, 2024 10:18

github-actions bot removed the e2e-tests label Aug 9, 2024

lvyanquan reviewed Aug 11, 2024

View reviewed changes

lvyanquan approved these changes Aug 11, 2024

View reviewed changes

github-actions bot added the reviewed label Aug 11, 2024

yuxiqian reviewed Aug 12, 2024

View reviewed changes

github-actions bot added the build label Aug 12, 2024

yuxiqian approved these changes Aug 12, 2024

View reviewed changes

proletarians force-pushed the v1.2 branch 3 times, most recently from b5b6980 to 93b6dee Compare August 12, 2024 04:10

github-actions bot added composer common labels Aug 12, 2024

proletarians force-pushed the v1.2 branch from 93b6dee to d851faa Compare August 12, 2024 04:14

github-actions bot removed the common label Aug 12, 2024

proletarians force-pushed the v1.2 branch from d851faa to c2c0eae Compare August 12, 2024 05:13

github-actions bot removed the composer label Aug 12, 2024

leonardBang approved these changes Aug 12, 2024

View reviewed changes

github-actions bot added the approved label Aug 12, 2024

proletarians added 2 commits August 12, 2024 18:08

[FLINK-35894][pipeline-connector][es] Introduce Elasticsearch Sink Co…

9107a2c

…nnector for Flink CDC Pipeline

[FLINK-35894][pipeline-connector][es] Support for ElasticSearch 6, 7 …

05bd700

…versions This closes apache#3495.

leonardBang force-pushed the v1.2 branch from 611ea95 to 05bd700 Compare August 12, 2024 10:26

leonardBang merged commit 8137f9d into apache:master Aug 12, 2024
2 checks passed

beryllw mentioned this pull request Aug 28, 2024

[FLINK-35894][Follow] fix Elasticsearch dependency version #3581

Open

qiaozongmi pushed a commit to qiaozongmi/flink-cdc that referenced this pull request Sep 23, 2024

[FLINK-35894][pipeline-connector][es] Support for ElasticSearch 6, 7 …

4ba8fa5

…versions This closes apache#3495.


		runJobWithEvents(events);

		verifyInsertedData(tableId, "2", 2, 2.0, "value2");

[FLINK-35894] Add Elasticsearch Sink Connector for Flink CDC Pipeline #3495

[FLINK-35894] Add Elasticsearch Sink Connector for Flink CDC Pipeline #3495

Conversation

proletarians commented Jul 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvyanquan commented Jul 25, 2024

Choose a reason for hiding this comment

lvyanquan Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvyanquan Aug 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvyanquan left a comment

Choose a reason for hiding this comment

lvyanquan commented Aug 11, 2024

yuxiqian commented Aug 12, 2024 • edited Loading

yuxiqian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuxiqian left a comment

Choose a reason for hiding this comment

leonardBang commented Aug 12, 2024

leonardBang left a comment

Choose a reason for hiding this comment

lvyanquan Jul 26, 2024 •

edited

Loading

lvyanquan Aug 11, 2024 •

edited

Loading

yuxiqian commented Aug 12, 2024 •

edited

Loading