Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation : Improve LIMIT size calculation to increase DB source syncs #20417

Closed
akashkulk opened this issue Dec 13, 2022 · 3 comments
Closed
Assignees
Labels
needs-triage team/db-dw-sources Backlog for Database and Data Warehouse Sources team type/enhancement New feature or request

Comments

@akashkulk
Copy link
Contributor

Tell us about the problem you're trying to solve

Currently, we optimize our batch sizes/limits based on current row size. As this PR showed, there are some major gains in performance from increasing batch size while executing queries.

Describe the solution you’d like

Investigate if it is possible to make additional gains in performance by further modifying this strategy while not OOMing

@akashkulk akashkulk added type/enhancement New feature or request needs-triage labels Dec 13, 2022
@akashkulk akashkulk added team/db-dw-sources Backlog for Database and Data Warehouse Sources team and removed team/triage labels Dec 13, 2022
@bleonard
Copy link
Contributor

bleonard commented Dec 13, 2022

The previous PR changed the parameters in the current model fairly conservatively.

We are totally opn to changing the model drsatically or incrementally is that's what makes sense.

When testing out other data loading tools (BEAM), we were seeing it query 100x what we are doing on a skinny table on a test that @grishick did. Ours seemed to cap out at 1000, even on a "skinny" table.

@bleonard bleonard changed the title Investigation : Optimize LIMIT size to increase DB source syncs Investigation : Improve LIMIT size to increase DB source syncs Dec 13, 2022
@bleonard bleonard changed the title Investigation : Improve LIMIT size to increase DB source syncs Investigation : Improve LIMIT size calculation to increase DB source syncs Dec 13, 2022
@bleonard
Copy link
Contributor

  1. Let's update the logs to say what we are setting as the fetch size.

  2. Let's fix the bug we found about the ratio.

@akashkulk
Copy link
Contributor Author

One thing to note is that our records don't cap out at 1,000. It's just that we have a progress indicator that prints out every 1000 records.

You can grep "fetch size" in the logs to see what the batch size is. Most JDBC drivers respects this parameter, but this is not guaranteed.

Some additional notes on the JDBC buffer can be found here : https://docs.google.com/document/d/10yo3oc4kwGAzUpgltpPyprohiYc0fsAMvb85pn-kdHw/edit#

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage team/db-dw-sources Backlog for Database and Data Warehouse Sources team type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants