Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: VTGate does not rollback open transactions on shutdown #16840

Open
brendar opened this issue Sep 24, 2024 · 4 comments
Open

Bug Report: VTGate does not rollback open transactions on shutdown #16840

brendar opened this issue Sep 24, 2024 · 4 comments

Comments

@brendar
Copy link
Contributor

brendar commented Sep 24, 2024

Overview of the Issue

On shutdown, vtgate will log messages indicating that it has rolled back open transactions, but they're not being rolled back.

The rollback on shutdown functionality was introduced in #5659 but I don't know if it ever worked. The test accompanying that PR doesn't actually assert transaction rollback, it's just asserting that a new transaction can't see rows written by the other open transaction, which is expected behavior at the default isolation level.

I've updated that test in #16839 which now fails with

Error:      	Received unexpected error:
                target: ks.0.primary: vttablet: Statement aborted because lock(s) could not be acquired immediately and NOWAIT is set. (errno 3572) (sqlstate HY000) (CallerID: userData1): Sql: "select * from buffer where id = :id /* INT64 */ for update nowait", BindVars: {#maxLimit: "type:INT64 value:\"10001\""id: "type:INT64 value:\"3\""} (errno 3572) (sqlstate HY000) during query: select * from buffer where id = 3 for update nowait
Test:       	TestTransactionRollBackWhenShutDown
Messages:   	for query: select * from buffer where id = 3 for update nowait

Reproduction Steps

1. Setup (shell terminal 1)

./101_initial_cluster.sh
mysql -e 'show vitess_tablets' # determine the primary tablet
tail -f vtdataroot/vt_0000000101/vttablet.out # or whichever tablet is the primary

2. Start a transaction (mysql terminal)

use commerce;
begin;
insert into customer (email) values ('foo');

3. Shut down vtgate (shell terminal 2)

../common/scripts/vtgate-down.sh

Results
After 10 seconds the vtgate-down script will exit after the vtgate process terminates

After 30 seconds you'll see a vttablet log like

W0924 13:04:32.225585   85718 tx_pool.go:133] killing transaction (exceeded timeout: 30s): 1727197254734999002	'principal:"brendan" subcomponent:"VTGate MySQL Connector"'	'username:"userData1"'	Sep 24 13:04:02.185697	Jan  1 00:00:00.000000	-9223372036.854776		insert into customer(email) values ('foo');

But the vtgate log will contain a message like

I0924 13:04:16.362320   87658 plugin_mysql_server.go:708] All connections have been rolled back.

Binary Version

Tested on main (`83b37b8e8fbc2353dd511f0e1223271e5879556c`) and v18

Operating System and Environment details

Tested on linux and locally on macos

Log Fragments

No response

@brendar brendar added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Sep 24, 2024
@brendar
Copy link
Contributor Author

brendar commented Sep 30, 2024

It looks like there's a race between servenv.OnClose(srv.rollbackAtShutdown)

servenv.OnClose(srv.rollbackAtShutdown)

and vtg.Gateway().Close(ctx)

servenv.OnClose(func() {
_ = vtg.Gateway().Close(ctx)
})

This causes txc.queryService(ctx, s.TabletAlias) in TxConn.ReleaseAll to get error tablet: cell:"zone1" uid:101 is either down or nonexistent because the healthcheck has been closed.

qs, err := txc.queryService(ctx, s.TabletAlias)

@brendar
Copy link
Contributor Author

brendar commented Sep 30, 2024

As far as fixing this, it seems like rollbackAtShutdown shouldn't be run OnClose since the components it depends on are likely being shut down simultaneously.

Would it make sense to call rollbackAtShutdown at the end of shutdownMysqlProtocolAndDrain (which is called OnTermSync)? It might require setting a limit on how long shutdownMysqlProtocolAndDrain waits for connections to be drained/idle, to leave enough time for rollbacks to complete before onterm_timeout is reached, but it looks plausible.

@brendar
Copy link
Contributor Author

brendar commented Oct 1, 2024

The impact of this bug is likely reduced in v19+ due to #14219
That change ensures that new transactions cannot be started during vtgate shutdown, which reduces the number of transactions left open at shutdown. At that point only transactions which were started before the shutdown period (onterm_timeout which defaults to 10s) would still be open.

@deepthi deepthi added Component: Query Serving and removed Needs Triage This issue needs to be correctly labelled and triaged labels Oct 2, 2024
@harshit-gangal
Copy link
Member

As suggested, looks like rollbackAtShutdown be best called inside shutdownMysqlProtocolAndDrain but we would need some wait timeout before proceeding with it.

If #14219 reduces the issue for you, I would hold off any changes in this area for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants