[fix][io] Update Elasticsearch sink idle cnx timeout to 30s #19377

michaeljmarshall · 2023-01-31T17:49:25Z

Motivation

The current Elasticsearch sink has a setting named connectionIdleTimeoutInMs. This setting closes expired and idle http connections opened by the Elasticsearch Rest Client. The current default is 5 milliseconds. With this value, I observed connection issues both for timeouts trying to connect to elasticsearch and errors like "connection reset by peer". When overriding the default to 5 seconds, all errors disappeared. I propose we change it to 30 seconds since even 5 seconds is a little low.

Modifications

Change default value for connectionIdleTimeoutInMs from 5 millis to 30000 millis
Remove unnecessary call to closeExpiredConnections. This call is unnecessary because the closeIdleConnections also closes expired connections. Based on looking at the source code, we iterate over connections for each method call, so relying on the closeIdleConnections lets us iterate over connections once instead of twice. Source: https://hc.apache.org/httpcomponents-asyncclient-4.1.x/current/httpasyncclient/apidocs/org/apache/http/nio/conn/NHttpClientConnectionManager.html#closeIdleConnections(long,%20java.util.concurrent.TimeUnit)

    /**
     * Closes idle connections in the pool.
     * <p>
     * Open connections in the pool that have not been used for the
     * timespan given by the argument will be closed.
     * Currently allocated connections are not subject to this method.
     * Times will be checked with milliseconds precision
     *
     * All expired connections will also be closed.
     *
     * @param idletime  the idle time of connections to be closed
     * @param tunit     the unit for the {@code idletime}
     *
     * @see #closeExpiredConnections()
     */
    void closeIdleConnections(long idletime, TimeUnit tunit);

Verifying this change

This is a trivial change.

Does this pull request potentially affect one of the following parts:

The default values of configurations

This PR changes a default value for a sink. The current default is problematic.

Documentation

doc-not-needed

We generate the docs for this sink, so we don't need to update any site docs. We should include this update on the release notes.

Matching PR in forked repository

PR in forked repository: michaeljmarshall#22

lhotari · 2023-01-31T18:27:56Z

/pulsarbot rerun-failure-checks

aymkhalil · 2023-01-31T18:27:42Z

...-io/elastic-search/src/main/java/org/apache/pulsar/io/elasticsearch/ElasticSearchConfig.java

    )
-    private int connectionIdleTimeoutInMs = 5;
+    private int connectionIdleTimeoutInMs = 30000;


I wonder if this config should be validated against bulkFlushIntervalInMs when bulk API is enabled - something like connectionIdleTimeoutInMs > 2 * bulkFlushIntervalInMs because it seems the connection will set idle by design in-between flushes

I think it'd be fine to let the connection get closed in that scenario. My main goal here is to make sure we have working defaults.

Yeah I understand both comments are probably outside scope of this PR: First step is to have working defaults, and later maybe make them foolproof...

aymkhalil · 2023-01-31T18:31:21Z

...ar-io/elastic-search/src/main/java/org/apache/pulsar/io/elasticsearch/client/RestClient.java

@@ -83,7 +83,6 @@ public RestClient(ElasticSearchConfig elasticSearchConfig, BulkProcessor.Listene
        // idle+expired connection evictor thread
        this.executorService = Executors.newSingleThreadScheduledExecutor();
        this.executorService.scheduleAtFixedRate(() -> {
-                    configCallback.connectionManager.closeExpiredConnections();
                    configCallback.connectionManager.closeIdleConnections(


Q: Is it required at all to evict idle connections? I wonder what's wrong with long lived connection that has a life cycle coupled with that of the sink instance. If it is not required, we could drop the connectionIdleTimeoutInMs for good but I maybe missing something.

Good question. The motivation of this PR assumes that closing these connections is necessary, however, I am not sure that it is. The fundamental risk is that something between the client and the server closes the connection. In my mind, the canonical example is a network load balancer with a 4 or 5 minute timeout.

Closing expired and idle connections is one solution to prevent such errors due to inactivity.

While troubleshooting the underlying behavior this PR aims to fix, I came across elastic/elasticsearch#65213, which indicates that an alternative solution is to enable socket keepalives and to decrease the net.ipv4.tcp_keepalive_time in order to make sure those keepalives are sent before any intermediate server closes the connection due to inactivity. Since that solution requires modifying OS settings, I think this solution might be easier to maintain, even though it'll be less efficient.

After re-reading that elasticsearch issue, it could be reasonable to move in the direction of enabling tcp keep-alives. At the very least, I think we should merge this and fix the existing default values.

Since that solution requires modifying OS settings

@michaeljmarshall On managed cloud k8s environments, the OS settings are already properly tuned. related comment: #14841 (comment)

@lhotari - that was not the case in the AKS cluster that I was testing with as of yesterday. When I tried to override the settings using https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/#setting-sysctls-for-a-pod, I got an error because overriding net.ipv4.tcp_keepalive_time = 300 is considered "unsafe" by default.

(cherry picked from commit 1481c74)

(cherry picked from commit 1481c74) (cherry picked from commit fd700da)

…9377) (cherry picked from commit 1481c74)

michaeljmarshall added 2 commits January 31, 2023 11:34

[fix][io] Update Elasticsearch sink idle cnx timeout to 30s

ca85ade

Remove unnecessary method call

7cc2ff9

michaeljmarshall added type/bug The PR fixed a bug or issue reported a bug release/note-required doc-not-needed Your PR changes do not impact docs release/2.11.1 release/2.9.5 release/2.10.4 labels Jan 31, 2023

michaeljmarshall added this to the 3.0.0 milestone Jan 31, 2023

michaeljmarshall requested review from eolivelli and nicoloboschi January 31, 2023 17:49

michaeljmarshall self-assigned this Jan 31, 2023

michaeljmarshall requested review from lhotari, freeznet and nlu90 January 31, 2023 17:52

nicoloboschi approved these changes Jan 31, 2023

View reviewed changes

lhotari approved these changes Jan 31, 2023

View reviewed changes

aymkhalil reviewed Jan 31, 2023

View reviewed changes

Improve updated field description

98da181

aymkhalil approved these changes Jan 31, 2023

View reviewed changes

lhotari merged commit 1481c74 into apache:master Jan 31, 2023

michaeljmarshall added a commit that referenced this pull request Jan 31, 2023

[fix][io] Update Elasticsearch sink idle cnx timeout to 30s (#19377)

dbe1c0a

(cherry picked from commit 1481c74)

michaeljmarshall added a commit that referenced this pull request Jan 31, 2023

[fix][io] Update Elasticsearch sink idle cnx timeout to 30s (#19377)

fd700da

(cherry picked from commit 1481c74)

michaeljmarshall added a commit that referenced this pull request Jan 31, 2023

[fix][io] Update Elasticsearch sink idle cnx timeout to 30s (#19377)

5265cb8

(cherry picked from commit 1481c74) (cherry picked from commit fd700da)

michaeljmarshall added cherry-picked/branch-2.9 Archived: 2.9 is end of life cherry-picked/branch-2.10 cherry-picked/branch-2.11 labels Jan 31, 2023

michaeljmarshall deleted the fix-bad-elasticsearch-http-client-default branch February 1, 2023 06:56

michaeljmarshall added a commit to datastax/pulsar that referenced this pull request Feb 8, 2023

[fix][io] Update Elasticsearch sink idle cnx timeout to 30s (apache#1…

4579052

…9377) (cherry picked from commit 1481c74)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][io] Update Elasticsearch sink idle cnx timeout to 30s #19377

[fix][io] Update Elasticsearch sink idle cnx timeout to 30s #19377

michaeljmarshall commented Jan 31, 2023 •

edited

Loading

lhotari commented Jan 31, 2023

aymkhalil Jan 31, 2023

michaeljmarshall Jan 31, 2023

aymkhalil Jan 31, 2023 •

edited

Loading

aymkhalil Jan 31, 2023

michaeljmarshall Jan 31, 2023

michaeljmarshall Jan 31, 2023

lhotari Jan 31, 2023

michaeljmarshall Jan 31, 2023

[fix][io] Update Elasticsearch sink idle cnx timeout to 30s #19377

[fix][io] Update Elasticsearch sink idle cnx timeout to 30s #19377

Conversation

michaeljmarshall commented Jan 31, 2023 • edited Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

lhotari commented Jan 31, 2023

aymkhalil Jan 31, 2023

Choose a reason for hiding this comment

michaeljmarshall Jan 31, 2023

Choose a reason for hiding this comment

aymkhalil Jan 31, 2023 • edited Loading

Choose a reason for hiding this comment

aymkhalil Jan 31, 2023

Choose a reason for hiding this comment

michaeljmarshall Jan 31, 2023

Choose a reason for hiding this comment

michaeljmarshall Jan 31, 2023

Choose a reason for hiding this comment

lhotari Jan 31, 2023

Choose a reason for hiding this comment

michaeljmarshall Jan 31, 2023

Choose a reason for hiding this comment

michaeljmarshall commented Jan 31, 2023 •

edited

Loading

aymkhalil Jan 31, 2023 •

edited

Loading