Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ksql 5716/close transient queries when network splits #6905

Conversation

swist
Copy link
Member

@swist swist commented Jan 26, 2021

Description

We've observed a behaviour where if there are two concurrent PushQueries running against a KSQL cluster, upon terminating the websocket in the browser, one of the queries lingers.

Testing done

This is a right pain to test because everything useful gets mocked, so I'm more than happy to take pointers about testing this unit-test wise. I have tested that the behaviour is now correct locally by doing the following:

  1. deploy ksqldb, c3 and confluent platform
  2. set up a socat relay between the frontend and ksqldb
  3. modify frontend to point at the relay instead of ksqldb
  4. open two queries in separate tabs
  5. kill socat to simulate a network split

Prior to the change one of the queries would always stay when running show queries from another client. After the change it does not. Body of commit 12fb072 has a decent explanation why this might be the case

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@swist swist requested a review from a team as a code owner January 26, 2021 21:35
@ghost
Copy link

ghost commented Jan 26, 2021

@confluentinc It looks like @swist just signed our Contributor License Agreement. 👍

Always at your service,

clabot

@swist swist requested a review from vcrfxia January 26, 2021 21:36
@swist swist force-pushed the KSQL-5716/close-transient-queries-when-network-splits branch 4 times, most recently from 12fb072 to bc91031 Compare January 27, 2021 09:03
@colinhicks
Copy link
Contributor

@swist is it feasible to add test coverage in io.confluent.ksql.rest.server.resources.WSQueryEndpointTest? Something like calling executeStreamQuery twice, attaching a test close handler to each. Then, close the second websocket before the first, and assert that both close handlers were called?

@swist swist force-pushed the KSQL-5716/close-transient-queries-when-network-splits branch from bc91031 to 7a68084 Compare January 27, 2021 17:15
Comment on lines 279 to 283
final String msg = String.format("Websocket %s closed, reason: %s, code: %s)",
websocket.textHandlerID(),
websocket.closeReason(),
websocket.closeStatusCode());
log.debug(msg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using the log.debug parameters to pass variables to the string. Like:

log.debug("Websocket {} closed, reason: {},  code: {})", 
        websocket.textHandlerID(), 
        websocket.closeReason(), 
        websocket.closeStatusCode());

Passing the variables to the LOG instead of using the String.format() has the benefit that if DEBUG is disabled (in most cases), then the string will not be formatted. However, if using String.format(), then the string is always formatted even if the log doesn't print it because debug is disabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, was looking for something like this!

private void attachCloseHandler(final ServerWebSocket websocket,
final WebSocketSubscriber<?> subscriber) {
websocket.closeHandler(v -> {
subscriber.close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can subscriber be null? I see that validation in the code you removed. Should it be checked before creating the closeHandler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't really a path that makes it null, but it makes sense to be more defensive there, I'll bring back the check!

Copy link
Contributor

@vcrfxia vcrfxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @swist ! LGTM pending the comments Colin and Sergio have left.

This is actually a pretty scary bug. WSQueryEndpoint is a singleton, but each call to executeStreamQuery handles a new websocket connection. What ended up happening is whenever we handled a query, we updated the subscription on the singleton. As a result whenever a websocket closed, we only closed the query that was currently attached to WSQueryEndpoint and lexically closed in the lambda passed into ServerWebsocket.handleClose
@swist swist force-pushed the KSQL-5716/close-transient-queries-when-network-splits branch from 7a68084 to 3ea65b6 Compare January 27, 2021 21:13
Copy link
Member

@spena spena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@swist
Copy link
Member Author

swist commented Jan 28, 2021

I'm going to merge this without the test for the private method behaviour, but will have a look at refactoring the websocket code a bit later

@swist swist merged commit 98c7d73 into confluentinc:master Jan 28, 2021
@swist swist deleted the KSQL-5716/close-transient-queries-when-network-splits branch January 28, 2021 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants