Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][io] Update Elasticsearch sink idle cnx timeout to 30s #19377

Conversation

michaeljmarshall
Copy link
Member

@michaeljmarshall michaeljmarshall commented Jan 31, 2023

Motivation

The current Elasticsearch sink has a setting named connectionIdleTimeoutInMs. This setting closes expired and idle http connections opened by the Elasticsearch Rest Client. The current default is 5 milliseconds. With this value, I observed connection issues both for timeouts trying to connect to elasticsearch and errors like "connection reset by peer". When overriding the default to 5 seconds, all errors disappeared. I propose we change it to 30 seconds since even 5 seconds is a little low.

Modifications

    /**
     * Closes idle connections in the pool.
     * <p>
     * Open connections in the pool that have not been used for the
     * timespan given by the argument will be closed.
     * Currently allocated connections are not subject to this method.
     * Times will be checked with milliseconds precision
     *
     * All expired connections will also be closed.
     *
     * @param idletime  the idle time of connections to be closed
     * @param tunit     the unit for the {@code idletime}
     *
     * @see #closeExpiredConnections()
     */
    void closeIdleConnections(long idletime, TimeUnit tunit);

Verifying this change

This is a trivial change.

Does this pull request potentially affect one of the following parts:

  • The default values of configurations

This PR changes a default value for a sink. The current default is problematic.

Documentation

  • doc-not-needed

We generate the docs for this sink, so we don't need to update any site docs. We should include this update on the release notes.

Matching PR in forked repository

PR in forked repository: michaeljmarshall#22

@lhotari
Copy link
Member

lhotari commented Jan 31, 2023

/pulsarbot rerun-failure-checks

)
private int connectionIdleTimeoutInMs = 5;
private int connectionIdleTimeoutInMs = 30000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this config should be validated against bulkFlushIntervalInMs when bulk API is enabled - something like connectionIdleTimeoutInMs > 2 * bulkFlushIntervalInMs because it seems the connection will set idle by design in-between flushes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be fine to let the connection get closed in that scenario. My main goal here is to make sure we have working defaults.

Copy link
Contributor

@aymkhalil aymkhalil Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand both comments are probably outside scope of this PR: First step is to have working defaults, and later maybe make them foolproof...

@@ -83,7 +83,6 @@ public RestClient(ElasticSearchConfig elasticSearchConfig, BulkProcessor.Listene
// idle+expired connection evictor thread
this.executorService = Executors.newSingleThreadScheduledExecutor();
this.executorService.scheduleAtFixedRate(() -> {
configCallback.connectionManager.closeExpiredConnections();
configCallback.connectionManager.closeIdleConnections(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Is it required at all to evict idle connections? I wonder what's wrong with long lived connection that has a life cycle coupled with that of the sink instance. If it is not required, we could drop the connectionIdleTimeoutInMs for good but I maybe missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The motivation of this PR assumes that closing these connections is necessary, however, I am not sure that it is. The fundamental risk is that something between the client and the server closes the connection. In my mind, the canonical example is a network load balancer with a 4 or 5 minute timeout.

Closing expired and idle connections is one solution to prevent such errors due to inactivity.

While troubleshooting the underlying behavior this PR aims to fix, I came across elastic/elasticsearch#65213, which indicates that an alternative solution is to enable socket keepalives and to decrease the net.ipv4.tcp_keepalive_time in order to make sure those keepalives are sent before any intermediate server closes the connection due to inactivity. Since that solution requires modifying OS settings, I think this solution might be easier to maintain, even though it'll be less efficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-reading that elasticsearch issue, it could be reasonable to move in the direction of enabling tcp keep-alives. At the very least, I think we should merge this and fix the existing default values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since that solution requires modifying OS settings

@michaeljmarshall On managed cloud k8s environments, the OS settings are already properly tuned. related comment: #14841 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhotari - that was not the case in the AKS cluster that I was testing with as of yesterday. When I tried to override the settings using https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/#setting-sysctls-for-a-pod, I got an error because overriding net.ipv4.tcp_keepalive_time = 300 is considered "unsafe" by default.

@lhotari lhotari merged commit 1481c74 into apache:master Jan 31, 2023
michaeljmarshall added a commit that referenced this pull request Jan 31, 2023
michaeljmarshall added a commit that referenced this pull request Jan 31, 2023
michaeljmarshall added a commit that referenced this pull request Jan 31, 2023
(cherry picked from commit 1481c74)
(cherry picked from commit fd700da)
@michaeljmarshall michaeljmarshall deleted the fix-bad-elasticsearch-http-client-default branch February 1, 2023 06:56
michaeljmarshall added a commit to datastax/pulsar that referenced this pull request Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants