🎉 JDBC source: adjust streaming query fetch size dynamically #12400

tuliren · 2022-04-27T14:24:00Z

What

This PR resolves Adjust JDBC source fetch size dynamically to avoid OOME #12192 and Make batchSize configurable or increase the value to improve performance #4314.

How

Firstly, set the fetch size to 10. Fetch the 10 rows, measure the mean serialized size of those rows, and use that size to estimate the best fetch size (N).
Secondly, set the fetch size to N. Sample the serialized size for every 100 rows to adjust the fetch size.

🚨 User Impact 🚨

This PR fixes the issue that JDBC source connector will fail to fetch database tables with extremely large rows.

tuliren · 2022-04-27T15:22:33Z

/test connector=connectors/source-postgres

🕑 connectors/source-postgres https://github.com/airbytehq/airbyte/actions/runs/2234049050
✅ connectors/source-postgres https://github.com/airbytehq/airbyte/actions/runs/2234049050
No Python unittests run

edgao

there's a few places where I think a comment would be useful, but this looks really good!

airbyte-db/lib/src/main/java/io/airbyte/db/jdbc/streaming/BaseSizeEstimator.java

airbyte-db/lib/src/main/java/io/airbyte/db/jdbc/streaming/TwoStageSizeEstimator.java

marcosmarxm · 2022-04-28T13:18:48Z

One question @tuliren what happens if a user have 2 connections that run in parallel? The first one is only calculating a limit of usage of total memory? Are the batch size recalculate every time or when the source throw a OOM?

tuliren · 2022-04-28T17:21:16Z

what happens if a user have 2 connections that run in parallel?
Are the batch size recalculate every time or when the source throw a OOM?

The batch size is recalculated per streaming query. Even within one connection, queries for each table can have a different batch size based on how large the average row in that table. So it does not matter how many connections a user has.

The first one is only calculating a limit of usage of total memory?

Not sure what you mean by "the first one". When calculating the batch size, we currently assume that there is a 200 MB buffer size in memory. So the batch size = 200 MB / mean row byte size.

grishick · 2022-04-29T00:08:37Z

airbyte-db/lib/src/main/java/io/airbyte/db/jdbc/streaming/FetchSizeConstants.java

+
+public final class FetchSizeConstants {
+
+  public static final long BUFFER_BYTE_SIZE = 200L * 1024L * 1024L; // 200 MB


Can we make this configurable, so that if a single row exceeds 200MB we can reconfigure the pod to have more memory and reconfigure the connector to have a larger buffer?

The 200 MB buffer size is not enforced. It is only used to calculated the fetch size. Currently, each connector has much more than 200 MB of heap size. The max row size the connector can handle is actually limited by the heap size.

I'd prefer not to expose this as a connector parameter. Users should not worry about this kind of low level details. It will make the setup confusing. For example, currently we let people configure part size in the blob storage connector. People don't always get what it means, and sometimes they can set a wrong value, resulting in failed connections. We are in the process of removing it.

If a row is larger than 200 MB, the user should store the data in the blob storage or something else. I don't think we need to support such edge case. No matter how large the buffer is, there can always be some use case that breaks it.

airbyte-db/lib/src/main/java/io/airbyte/db/jdbc/streaming/AdaptiveStreamingQueryConfig.java

grishick · 2022-04-29T00:15:09Z

airbyte-db/lib/src/main/java/io/airbyte/db/jdbc/streaming/BaseSizeEstimator.java

+    if (rawFetchSize > Integer.MAX_VALUE) {
+      return maxFetchSize;
+    }
+    return Math.max(minFetchSize, Math.min(maxFetchSize, (int) rawFetchSize));


Can this ever return 0?
What happens when the estimator estimates that even a single row is larger than available buffer?

No, it will always return a value in the range of [minFetchSize, maxFetchSize].

As I mentioned in the above comment, as long as the row can fit in the total heap, the connector can still handle it.

tuliren · 2022-04-29T03:55:15Z

/test connector=connectors/source-postgres

🕑 connectors/source-postgres https://github.com/airbytehq/airbyte/actions/runs/2243262779
✅ connectors/source-postgres https://github.com/airbytehq/airbyte/actions/runs/2243262779
No Python unittests run

tuliren · 2022-04-29T03:55:22Z

/test connector=connectors/source-mysql

🕑 connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/2243263188
✅ connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/2243263188
No Python unittests run

tuliren · 2022-04-29T03:55:29Z

/test connector=connectors/source-mssql

🕑 connectors/source-mssql https://github.com/airbytehq/airbyte/actions/runs/2243263525
✅ connectors/source-mssql https://github.com/airbytehq/airbyte/actions/runs/2243263525
No Python unittests run

tuliren · 2022-04-29T03:55:38Z

/test connector=connectors/source-snowflake

🕑 connectors/source-snowflake https://github.com/airbytehq/airbyte/actions/runs/2243263971
❌ connectors/source-snowflake https://github.com/airbytehq/airbyte/actions/runs/2243263971
🐛 https://gradle.com/s/w6pfry7s2g3oe

tuliren · 2022-04-29T04:32:59Z

/test connector=connectors/source-snowflake

🕑 connectors/source-snowflake https://github.com/airbytehq/airbyte/actions/runs/2243372395
✅ connectors/source-snowflake https://github.com/airbytehq/airbyte/actions/runs/2243372395
No Python unittests run

tuliren · 2022-04-29T04:46:20Z

/test connector=connectors/source-cockroachdb

🕑 connectors/source-cockroachdb https://github.com/airbytehq/airbyte/actions/runs/2243409239
✅ connectors/source-cockroachdb https://github.com/airbytehq/airbyte/actions/runs/2243409239
No Python unittests run

tuliren · 2022-04-29T04:46:27Z

/test connector=connectors/source-db2

🕑 connectors/source-db2 https://github.com/airbytehq/airbyte/actions/runs/2243409513
✅ connectors/source-db2 https://github.com/airbytehq/airbyte/actions/runs/2243409513
No Python unittests run

tuliren · 2022-04-29T04:46:38Z

/test connector=connectors/source-oracle

🕑 connectors/source-oracle https://github.com/airbytehq/airbyte/actions/runs/2243409966
✅ connectors/source-oracle https://github.com/airbytehq/airbyte/actions/runs/2243409966
No Python unittests run

tuliren · 2022-04-29T04:46:47Z

/test connector=connectors/source-redshift

🕑 connectors/source-redshift https://github.com/airbytehq/airbyte/actions/runs/2243410375
✅ connectors/source-redshift https://github.com/airbytehq/airbyte/actions/runs/2243410375
No Python unittests run

tuliren · 2022-04-29T04:47:03Z

/test connector=connectors/source-tidb

🕑 connectors/source-tidb https://github.com/airbytehq/airbyte/actions/runs/2243411344
✅ connectors/source-tidb https://github.com/airbytehq/airbyte/actions/runs/2243411344
No Python unittests run

tuliren · 2022-04-29T04:54:27Z

Will publish new connector versions in separate PRs.

SamiRiahy · 2022-05-11T08:52:03Z

Hi @tuliren ,

When you say "adjust streaming query fetch size dynamically", does the fetch size depend on the capacity of database sources ?

Now if i increase the capacity of my database sources (increase the buffer size and fetch size ) and if i give 20GO of RAM to Airbyte (JOB_MAIN_CONTAINER_MEMORY_REQUEST and JOB_MAIN_CONTAINER_MEMORY_LIMIT ) does this will help me to have better performance ?

however the preformance the logs will print every 1000 rows ?

marcosmarxm · 2022-05-12T01:39:42Z

Did you update the connector version @SamiRiahy ?

cgardens · 2022-05-12T02:55:57Z

@tuliren I'm late to the party... this is awesome!

tuliren · 2022-05-12T03:12:19Z

Hi @SamiRiahy,

When you say "adjust streaming query fetch size dynamically", does the fetch size depend on the capacity of database sources ?

No, the fetch size depends on how large the average row size is. Internally we allocate roughly 200 MB in memory as the buffer. We first measure the average row size, say it is X MB, and then calculate the fetch size by 200 / X. In this way, we can 1) avoid reading too much data in each fetch and prevent the out-of-memory issue, and 2) read more rows in each fetch if the average row size is small to improve the performance. Currently the max fetch size is set to 100K.

In reality, I did not see much performance improvement by reading more rows per fetch. To better improve the performance, we need to investigate the bottleneck more closely. Here is the issue that tracks this topic: #12532

if i increase the capacity of my database sources (increase the buffer size and fetch size ) and if i give 20GO of RAM to Airbyte (JOB_MAIN_CONTAINER_MEMORY_REQUEST and JOB_MAIN_CONTAINER_MEMORY_LIMIT ) does this will help me to have better performance ?

By "20GO", do you mean 20 GB? This is probably unnecessary, and I think it won't improve the performance much, for two reasons:

Right now we only allocate 200MB memory as the buffer to store the fetched data, and this size is hard-coded right now. So the extra memory available to the source connector won't be utilized for the buffer. The extra memory will only help if each row is really really large.
Based on my experiments, reading more rows per fetch does not affect the performance much. So even though the buffer is larger, we may not see a significant performance boost from that.

Internally we use 750 MB for the database source connectors. The database connector should be able to work with only 500 MB for most dataset (i.e. the dataset does not have tables with fat rows like 100MB per row). If you have spare resource, giving it 1GB should be more than enough.

however the preformance the logs will print every 1000 rows ?

Yes. We log the record count for every 1000 rows, and the fetch size for every 100 rows. I think the latter is too frequent. I will reduce the logging frequency.

* Merge all streaming configs to one * Implement new streaming query config * Format code * Fix comparison * Use double for mean byte size * Update fetch size only when changed * Calculate mean size by sampling n rows * Add javadoc * Change min fetch size to 1 * Add comment by buffer size * Update java connector template * Perform division first * Add unit test for fetching large rows * Format code * Fix connector compilation error

github-actions bot added the area/connectors Connector related issues label Apr 27, 2022

tuliren added 3 commits April 27, 2022 07:29

Merge all streaming configs to one

c15933f

Implement new streaming query config

bbeca63

Format code

d63e0e1

tuliren force-pushed the liren/set-fetch-size-in-streaming-db branch from 9f81e2b to d63e0e1 Compare April 27, 2022 14:29

tuliren temporarily deployed to more-secrets April 27, 2022 14:31 Inactive

Fix comparison

47b9c83

tuliren requested review from edgao and subodh1810 April 27, 2022 15:16

tuliren temporarily deployed to more-secrets April 27, 2022 15:24 Inactive

tuliren mentioned this pull request Apr 27, 2022

Make batchSize configurable or increase the value to improve performance #4314

Closed

tuliren linked an issue Apr 27, 2022 that may be closed by this pull request

Make batchSize configurable or increase the value to improve performance #4314

Closed

edgao approved these changes Apr 27, 2022

View reviewed changes

tuliren added 3 commits April 28, 2022 15:04

Use double for mean byte size

aa38849

Update fetch size only when changed

dcbeafd

Merge branch 'master' into liren/set-fetch-size-in-streaming-db

1d39acc

tuliren temporarily deployed to more-secrets April 28, 2022 22:54 Inactive

Calculate mean size by sampling n rows

385c6bf

grishick reviewed Apr 29, 2022

View reviewed changes

airbyte-db/lib/src/main/java/io/airbyte/db/jdbc/streaming/AdaptiveStreamingQueryConfig.java Outdated Show resolved Hide resolved

grishick reviewed Apr 29, 2022

View reviewed changes

tuliren added 3 commits April 28, 2022 17:43

Add javadoc

001e386

Change min fetch size to 1

613d429

Add comment by buffer size

4916f35

tuliren temporarily deployed to more-secrets April 29, 2022 03:31 Inactive

Format code

a6c5b21

tuliren temporarily deployed to more-secrets April 29, 2022 03:55 Inactive

Fix connector compilation error

8e2f85c

tuliren temporarily deployed to more-secrets April 29, 2022 04:34 Inactive

tuliren temporarily deployed to more-secrets April 29, 2022 04:35 Inactive

tuliren merged commit 55a0db7 into master Apr 29, 2022

tuliren deleted the liren/set-fetch-size-in-streaming-db branch April 29, 2022 05:36

tuliren mentioned this pull request Apr 29, 2022

🎉 Jdbc sources: publish new version with adaptive fetch size #12480

Merged

9 tasks

octavia-squidington-iii mentioned this pull request Apr 29, 2022

Bump Airbyte version from 0.36.5-alpha to 0.36.6-alpha #12485

Merged

marcosmarxm mentioned this pull request May 2, 2022

Troubleshooting increase batchSize and add section in database source #4569

Closed

marcosmarxm mentioned this pull request Jul 23, 2022

Investigate the performance bottleneck of source database connectors #12532

Closed

evantahler mentioned this pull request Nov 16, 2022

Increase Database Source SELECT Batch Size #19514

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉 JDBC source: adjust streaming query fetch size dynamically #12400

🎉 JDBC source: adjust streaming query fetch size dynamically #12400

tuliren commented Apr 27, 2022 •

edited

Loading

tuliren commented Apr 27, 2022 •

edited by github-actions bot

Loading

edgao left a comment

marcosmarxm commented Apr 28, 2022

tuliren commented Apr 28, 2022

grishick Apr 29, 2022

tuliren Apr 29, 2022

grishick Apr 29, 2022

tuliren Apr 29, 2022

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022

SamiRiahy commented May 11, 2022

marcosmarxm commented May 12, 2022

cgardens commented May 12, 2022 •

edited

Loading

tuliren commented May 12, 2022 •

edited

Loading


		public final class FetchSizeConstants {

		public static final long BUFFER_BYTE_SIZE = 200L * 1024L * 1024L; // 200 MB

🎉 JDBC source: adjust streaming query fetch size dynamically #12400

🎉 JDBC source: adjust streaming query fetch size dynamically #12400

Conversation

tuliren commented Apr 27, 2022 • edited Loading

What

How

Recommended reading order

🚨 User Impact 🚨

tuliren commented Apr 27, 2022 • edited by github-actions bot Loading

edgao left a comment

Choose a reason for hiding this comment

marcosmarxm commented Apr 28, 2022

tuliren commented Apr 28, 2022

grishick Apr 29, 2022

Choose a reason for hiding this comment

tuliren Apr 29, 2022

Choose a reason for hiding this comment

grishick Apr 29, 2022

Choose a reason for hiding this comment

tuliren Apr 29, 2022

Choose a reason for hiding this comment

tuliren commented Apr 29, 2022 • edited by github-actions bot Loading

tuliren commented Apr 29, 2022 • edited by github-actions bot Loading

tuliren commented Apr 29, 2022 • edited by github-actions bot Loading

tuliren commented Apr 29, 2022 • edited by github-actions bot Loading

tuliren commented Apr 29, 2022 • edited by github-actions bot Loading

tuliren commented Apr 29, 2022 • edited by github-actions bot Loading

tuliren commented Apr 29, 2022 • edited by github-actions bot Loading

tuliren commented Apr 29, 2022 • edited by github-actions bot Loading

tuliren commented Apr 29, 2022 • edited by github-actions bot Loading

tuliren commented Apr 29, 2022 • edited by github-actions bot Loading

tuliren commented Apr 29, 2022

SamiRiahy commented May 11, 2022

marcosmarxm commented May 12, 2022

cgardens commented May 12, 2022 • edited Loading

tuliren commented May 12, 2022 • edited Loading

tuliren commented Apr 27, 2022 •

edited

Loading

tuliren commented Apr 27, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 29, 2022 •

edited by github-actions bot

Loading

cgardens commented May 12, 2022 •

edited

Loading

tuliren commented May 12, 2022 •

edited

Loading