Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove COM_MULTI batching #946

Closed
bgrainger opened this issue Feb 14, 2021 · 8 comments
Closed

Remove COM_MULTI batching #946

bgrainger opened this issue Feb 14, 2021 · 8 comments
Milestone

Comments

@bgrainger
Copy link
Member

Batching was added in #650.

An efficient implementation (using COM_MULTI) is only supported by MariaDB server, and that support is being removed in 10.6.0.

The implementation complicates the client code and will soon serve little purpose (as users upgrade away from the MariaDB server versions that support it). It also has some bugs that are only exposed on certain code paths: #937.

The open question would be whether the current MySqlBatch code is retained (which will essentially string-concatenate statements behind the scenes), or if that is removed (for simplicity) and only supported via generic ADO.NET support in .NET 6. CC @roji.

@roji
Copy link

roji commented Feb 18, 2021

@bgrainger that's sad to hear, I was hoping the new ADO batching API would be beneficial in the MySQL case too...

@vaintroub
Copy link
Contributor

vaintroub commented May 5, 2021

new ADO batching API is not entirely new.
JDBC's Statement.addBatch/Statement.executeBatch have efficient implementations in some MySQL drivers, and probably the most sophisticated is in MariaDB JDBC, which ranges from simple semicolon batching to protocol pipelining to converting multiple insert statements to multi-insert, to using COM_BULK MariaDB protocol command.

COM_MULTI was a misunderstanding, basically same thing can be achieved with protocol pipelining, entirely in the client, by sending multiple commands and reading multiple responses, without changing anything in the server.

@bgrainger
Copy link
Member Author

protocol pipelining, entirely in the client

If the server supports it. Aurora doesn't (#486 (comment)) but there's no server capabilities flag (AFAIK) that indicates whether it would be safe or not. With the wide varieties of back-ends out there that speak some variation of the MySQL text protocol, it would be risky to assume it's supported. It could be an opt-in behaviour with a connection string option, of course.

(This is just a quick reply; I need to spend some more time thinking about if/how MySqlBatch could be implemented with a pipelined approach, for servers that do support it.)

@vaintroub
Copy link
Contributor

vaintroub commented May 6, 2021

Of course there is no capability flag that would say: my server will read, and silently drop client data.. What Aurora implements is called a bug. What MariaDB JDBC implements, is pipelining by default, but a flag to disable it, for Aurora only.

@roji
Copy link

roji commented May 6, 2021

I know nothing about MySQL/MariaDB/Aurora... But the first time Npgsql connects to a server (host/port combination), it sends out a special query to learn certain things about the server, which may affect later behavior. You could do something similar to detect the various MySQL variants, and then assume e.g. that Aurora doesn't support batching.

Another way to approach this, would be for MySqlBatch to simply implement the pipelining behavior; using that API could mean you're opting into pipelining. Semicolon batching could still be supported inside a single MySqlCommand, or you could have some sort of flag on MySqlBatch itself to control this or whatever.

One small note about pipelining vs. batching - the two are indeed very similar but not exactly identical... For example, in PG, failure in an earlier batched command causes skipping later ones. Similarly, the batched commands run in an implicit transaction (unless an explicit transaction is already in progress). All this isn't mandatory or anything - just discussing other aspects of the feature.

If we're really discussing pure pipelining, i.e. pushing more commands into the pipe before earlier ones have completed, then that could also be implemented simply by allowing MySqlConnector to call DbCommand.ExecuteReader before a previous reader has completed. MySqlConnector would have to internally track ordering of commands (via a queue), which may be a bit more complex than a simple batching implementation; but it would also be slightly more powerful, since users don't have to gather all outgoing commands before sending them all out (i.e. they can be sent out gradually).

@vaintroub
Copy link
Contributor

vaintroub commented May 6, 2021

Alright, MySQL protocol is such that it is the server that sends some information already when the client connects - the version string, capability flags and such. From what I found on the internet, Aurora might be detectable with a version string, too (https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Updates.Versions.html#AuroraMySQL.Updates.EngineVersions). I did not check it myself, no AWS access right now.

There is no implicit transaction for the semicolon batch, but yes, an error would stop a batch execution, just like for PG.
In general, pipelining does not provide benefits as big as "smart" query rewriting - a prepared statement INSERT INTO t values(?,?,?) , batches with thousands of rows, in JDBC would result into single multi-row insert, which is orders of magnitudes faster than inserting rows one-by-one, with or without pipelining. This is known as query rewriting.

One more pipelining pecularity is that if there are large-ish result sets, client would still need to read them in parallel somehow, before it finished sending the full batch. Otherwise, when client only sends and does not read results, TCP buffers will fill up, and server will stuck not being able to send more data to client, and then the client will stuck not being able to send more data to server. Perhaps, if client detects there is anything on a socket, while it still writes the batch, it could stop writing for a while, and start reading results

@roji
Copy link

roji commented May 6, 2021

There is no implicit transaction for the semicolon batch, but yes, an error would stop a batch execution, just like for PG.

Right, but there's nothing like this with "pipeline batching", right? The PG protocol has a special Sync message whose meaning is "when there's an error, discard all messages up to this one". This provides the skip-on-error behavior.

In general, pipelining does not provide benefits as big as "smart" query rewriting - a prepared statement INSERT INTO t values(?,?,?) , batches with thousands of rows, in JDBC would result into single multi-row insert, which is orders of magnitudes faster than inserting rows one-by-one, with or without pipelining. This is known as query rewriting.

FWIW I'm personally against this form of optimization - the driver should not be in the business of optimizing the user's SQL, that's the user's responsibility (why not have the driver analyze and change your SQL to do more efficient JOINs too...). Parsing SQL at the driver also has an overhead which slows everything down. AFAIK this kind of rewriting is mostly relevant for bulk insert (as in your example); but if you're doing this and really care about perf, a separate bulk insert protocol would probably be more appropriate anyway (assuming MySQL has one).

One more pipelining pecularity is that if there are large-ish result sets, client would still need to read them in parallel somehow,

The exact same deadlock situation exists in PostgreSQL - you cannot first finish writing, and only then allow the user to start reading. The Npgsql details are a bit complex, but the tl;dr is that writing of a batch is done asynchronously and in parallel, in the background as the user is reading results from earlier queries in the batch.

@vaintroub
Copy link
Contributor

vaintroub commented May 6, 2021

we don't have a special batching marker. The batching worked just out of the box, but yes, a "batch start" and "batch end" commands would be interesting

FWIW, I'm all for "whatever works best" wrt to optimization. For example, prepared statements in MySQL, and by extension in MariaDB are suboptimally implemented, and are not much more than memory hog on the server (a parsed string, more or less, once per connection, and optimizer does not reuse the plans even for the same connection). Thus, most of connectors implement client-side prepare, i.e replacing question marks with actual, correctly escaped data. This also turns out helpful when building a multi-insert strings for batching.
Having said that, MariaDB (but not MySQL) also has a COM_BULK protocol command for prepared statements, where you can bind multiple rows for a single server-side prepared statement. This is not really suitable for terabytes of data, since server will load full payload into memory, but it spares some logic on the client , and works with UPDATE too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants