Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a QueryStatementBatcher #173

Merged
merged 3 commits into from
Aug 14, 2023
Merged

Add a QueryStatementBatcher #173

merged 3 commits into from
Aug 14, 2023

Conversation

ePaul
Copy link
Member

@ePaul ePaul commented Jul 28, 2023

This allows to fake batches for statements returning data. Not yet used in the rest of the code, that will be added in later commits.

(This came up as part of #170.)

JDBC distinguishes between "queries" (which can return results) and "updates" (which only can return a number of affected rows). Only for updates you can use the batching feature of JDBC (i.e. sending multiple statements together to the server to have them executed together, saving the round trips to the DB and back). This is then also reflected in Spring's JdbcTemplate (and NamedParameterJdbcTemplate).

As we want to return the auto-generated ID from the inserted rows, the "batching" feature won't work for us.
Searching in the WWW, I found Batching Select Statements in JDBC, which suggests this approach:

  • Have several prepared statements for different input batch sizes
  • partition your data to use the biggest ones first, then the smaller ones for the remainder
  • put the results together.

This class implements this concept in a generic way, based on Spring's NamedParameterJdbcTemplate.
The user provides a template for the SQL string (prefix, repeated part, suffix) with parameter name placeholders, and this class will take case of expanding the template into the several batch sizes of statements, and then using it to actually query/insert/return data in bulk.

(Mentioned in this Stack Overflow Answer.)

This allows to fake batches for statements returning data.
Not yet used in the rest of the code, that will be added in later commits.
@ePaul ePaul added spring-boot-2 Issues/PRs which only apply to the Spring-Boot 2 versions (Releases 20.*+) persistence everything around DB access labels Jul 28, 2023
@ePaul ePaul requested a review from fbrns as a code owner July 28, 2023 13:50
@fbrns
Copy link
Member

fbrns commented Aug 3, 2023

👍

@ePaul
Copy link
Member Author

ePaul commented Aug 7, 2023

(I've not yet merged this, because I'm thinking of doing a benchmark.)

Results from running on my computer are in a PR comment.
@ePaul
Copy link
Member Author

ePaul commented Aug 9, 2023

I did run a very simple benchmark (as a Unit test running against a local DB, for tiny data sets (inserting a row with two integers, returning an ID). Code is in QueryStatementBatcherIT (but disabled for normal builds, to not slow the build down).

The difference is likely less extreme for larger data, but more extreme when the database is further away (network-wise).

Method inserted item count time
QueryStatementBatcher (51, 13 ,4, 1) 5000 0.57s
QueryStatementBatcher (1100, 200, 51, 13, 4, 1) 5000 0.16s
separate queries, serially 5000 22.08s
separate queries, parallelStream 5000 3.2s
batchUpdate (without returning results) 5000 0.10s

The last two don't really work in our case – parallelStream is inserting the items in an arbitrary order, and batchUpdate (which is what we were using before) won't return the generated IDs.

Though this indicates that we generally only should use the batcher when we actually need the IDs, as there is some performance penalty for getting them.

Let me see whether there is an easy way to do this kind of batching where there is nothing to return.

@ePaul
Copy link
Member Author

ePaul commented Aug 9, 2023

Some more investigation showed that JDBC actually (since JDBC 3.0, in Java 1.4) provides a way for fetching the keys generated in update statements.

In Spring JDBC, this is mapped via the KeyHolder interface which can be passed to some of the update methods to JDBCTemplate and NamedParameterJDBCTemplate. Unfortunately, the batchUpdate methods don't support this.

I've found this Stack overflow answer showing a way how to do both, proving that it works (if your DB + JDBC driver supports it – it seems like postgresql does).

I've added a modified version of that answer to the benchmark (not committed) – it takes slightly less time than my QueryStatementBatcher with huge batches (0.15s instead of 0.16s), but still significantly more than the "do not return IDs" version doing a simple batch update.

There is an old issue (originally opened in Jira in 2006, closed there in 2012) for Spring JDBC suggesting to add this to JdbcTemplate + NamedParameterJdbcTemplate, with a relatively new pull request from 2022 actually implementing it (in a cleaner way than the mentioned SO answer). Given that this one is not yet merged, and thus won't arrive in spring JDBC soon, I'll not wait for it.

@ePaul
Copy link
Member Author

ePaul commented Aug 9, 2023

Let me see whether there is an easy way to do this kind of batching where there is nothing to return.

I've also experimented with extending the QueryStatementBatcher to build similar "batched" statements for pure inserts (without returning results) – it is a bit faster than the batcher when returning results, but still slower than the plain batcher. The code is in a separate branch which I don't intend merging, just in case someone is interested.

@ePaul
Copy link
Member Author

ePaul commented Aug 9, 2023

👍

1 similar comment
@fbrns
Copy link
Member

fbrns commented Aug 14, 2023

👍

@ePaul ePaul merged commit 765194f into master Aug 14, 2023
@ePaul ePaul deleted the querystatementbatcher branch August 14, 2023 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
persistence everything around DB access spring-boot-2 Issues/PRs which only apply to the Spring-Boot 2 versions (Releases 20.*+)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants