Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node.js connection idle sockets timeout rework, logger improvements #242

Merged
merged 3 commits into from
Mar 13, 2024

Conversation

slvrtrn
Copy link
Contributor

@slvrtrn slvrtrn commented Mar 13, 2024

Summary

Initially, this PR was planned for 1.0.0, but it will be safer to release this as 0.3.0-beta.1 in advance to test this particular changeset.

Resolves #237
Resolves #196 (it's not that KeepAlive was degrading performance; it was that KeepAlive=false was still using the entire KeepAlive mechanism).

Changes (some are breaking):

  • Idle sockets timeout rework; now, the client tracks the idling sockets (on top of what the KeepAlive agent should normally do) and forcefully destroys them when it sees fit (by default, after 2500ms of idling, to match the default ClickHouse server KeepAlive configuration).
  • Ensure proper Connection header value considering KeepAlive settings. If KeepAlive is disabled, its value is now forced to "close".
  • (Breaking) keep_alive.retry_on_expired_socket and keep_alive.socket_ttl configuration parameters are removed. Instead, the new keep_alive.idle_socket_ttl parameter should be used. The default value for keep_alive.idle_socket_ttl is 2500.
  • Fixed a bug with Ping that could lead to unhandled Socket hang-up propagation.
  • Added an extra layer of "protection" against potential socket issues to the query/select/exec methods in the underlying HTTP connection implementation. If a request fails, the client will try to ensure that the outgoing request is also destroyed, meaning that the socket will be released immediately; this socket also shouldn't produce any unexpected "results" later (such as, again, unhandled Socket hang up from a rogue socket).
  • (Technically breaking) The max_open_connections configuration parameter is now 10 by default, as we should not rely on the KeepAlive agent's defaults.
  • (from 1.0.0, technically breaking) Backported the fix of the default request_timeout configuration value (now it is correctly set to 30s).
  • Logger and logging improvements - a bit fancier default console writer and more logs on failing requests (if the log level is not OFF in the client configuration (it is off by default)); all client methods except ping will log an error on failure now, and ping will log a warning (cause the error is returned as part of its result).
  • Split/fixed connection unit tests cause it already had too many test cases for a single file.
  • Fixed a minor "lint" in toURLSearchParams since query_id has not been optional within the connection for a long time and is present for any request that uses it.
  • From Add SharedMergeTree Cloud tests, remove Node 16 from the CI, add Node 21 #234 - removed Node.js 16 from the CI and bumped GHA versions.

A changelog entry will be added if we decide to proceed with this.

Checklist

  • Unit and integration tests covering the common scenarios were added
  • A human-readable description of the changes was provided to include in CHANGELOG
  • For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

@slvrtrn slvrtrn requested a review from mshustov March 13, 2024 01:33
@slvrtrn slvrtrn mentioned this pull request Mar 13, 2024
@slvrtrn slvrtrn merged commit 2b03b46 into main Mar 13, 2024
23 checks passed
@slvrtrn slvrtrn deleted the free-idle-sockets-rework branch March 13, 2024 16:12
@@ -26,12 +26,12 @@ jobs:
strategy:
fail-fast: true
matrix:
node: [16, 18, 20]
node: [18, 20, 21]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

21 is not LTS and might be unstable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth keeping if it does not bring any flakiness to the CI run. If it starts to do so, we will remove it.

@@ -51,6 +51,8 @@ export type ConnPingResult =
}
| { success: false; error: Error }

export type ConnOperation = 'Ping' | 'Query' | 'Insert' | 'Exec'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd rather use the fullname ConnectorOperation or just Operation to avoid any abbreviations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rename all of these types in 1.0.0, then. They are all prefixed with just Conn.

this.logger.trace(params)
this.logger.trace({
...params,
module: params.module ?? this.module,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enhancement request: context-tree like in log4j

server.close()
})

// ping first, then 2 operations in all possible combinations - repeat every combination several times
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow that's something :D

expect().nothing()
})

it('should not throw unhandled errors with Select', async () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we test this. The test verifies select throws Timeout error. exception. How does just runner handles uncaught exceptions / promise rejections?
https://developer.mozilla.org/en-US/docs/Web/API/Window/unhandledrejection_event
https://nodejs.org/api/process.html#event-unhandledrejection

Copy link
Contributor Author

@slvrtrn slvrtrn Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test will still fail if something is thrown after the timeout error and it is unhandled (like something thrown from the Node's HTTP internals after we get the timeout error). This was the main problem with Ping, actually. That's also why there is a test of Ping + a few other operations in sequence.

This was referenced Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve sockets housekeeping mechanism Keep-Alive degrades performance
2 participants