-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a QueryStatementBatcher #173
Conversation
This allows to fake batches for statements returning data. Not yet used in the rest of the code, that will be added in later commits.
...er/src/main/java/org/zalando/nakadiproducer/eventlog/impl/batcher/QueryStatementBatcher.java
Outdated
Show resolved
Hide resolved
c6941de
to
96ad9bc
Compare
👍 |
(I've not yet merged this, because I'm thinking of doing a benchmark.) |
Results from running on my computer are in a PR comment.
I did run a very simple benchmark (as a Unit test running against a local DB, for tiny data sets (inserting a row with two integers, returning an ID). Code is in QueryStatementBatcherIT (but disabled for normal builds, to not slow the build down). The difference is likely less extreme for larger data, but more extreme when the database is further away (network-wise).
The last two don't really work in our case – parallelStream is inserting the items in an arbitrary order, and batchUpdate (which is what we were using before) won't return the generated IDs. Though this indicates that we generally only should use the batcher when we actually need the IDs, as there is some performance penalty for getting them. Let me see whether there is an easy way to do this kind of batching where there is nothing to return. |
Some more investigation showed that JDBC actually (since JDBC 3.0, in Java 1.4) provides a way for fetching the keys generated in update statements. In Spring JDBC, this is mapped via the KeyHolder interface which can be passed to some of the I've found this Stack overflow answer showing a way how to do both, proving that it works (if your DB + JDBC driver supports it – it seems like postgresql does). I've added a modified version of that answer to the benchmark (not committed) – it takes slightly less time than my QueryStatementBatcher with huge batches (0.15s instead of 0.16s), but still significantly more than the "do not return IDs" version doing a simple batch update. There is an old issue (originally opened in Jira in 2006, closed there in 2012) for Spring JDBC suggesting to add this to JdbcTemplate + NamedParameterJdbcTemplate, with a relatively new pull request from 2022 actually implementing it (in a cleaner way than the mentioned SO answer). Given that this one is not yet merged, and thus won't arrive in spring JDBC soon, I'll not wait for it. |
I've also experimented with extending the QueryStatementBatcher to build similar "batched" statements for pure inserts (without returning results) – it is a bit faster than the batcher when returning results, but still slower than the plain batcher. The code is in a separate branch which I don't intend merging, just in case someone is interested. |
👍 |
1 similar comment
👍 |
This allows to fake batches for statements returning data. Not yet used in the rest of the code, that will be added in later commits.
(This came up as part of #170.)
JDBC distinguishes between "queries" (which can return results) and "updates" (which only can return a number of affected rows). Only for updates you can use the batching feature of JDBC (i.e. sending multiple statements together to the server to have them executed together, saving the round trips to the DB and back). This is then also reflected in Spring's JdbcTemplate (and NamedParameterJdbcTemplate).
As we want to return the auto-generated ID from the inserted rows, the "batching" feature won't work for us.
Searching in the WWW, I found Batching Select Statements in JDBC, which suggests this approach:
This class implements this concept in a generic way, based on Spring's NamedParameterJdbcTemplate.
The user provides a template for the SQL string (prefix, repeated part, suffix) with parameter name placeholders, and this class will take case of expanding the template into the several batch sizes of statements, and then using it to actually query/insert/return data in bulk.
(Mentioned in this Stack Overflow Answer.)