Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Postgres : Use fast select to get table size estimate in Postgres #21499

Closed
akashkulk opened this issue Jan 17, 2023 · 2 comments
Closed
Labels
needs-triage team/db-dw-sources Backlog for Database and Data Warehouse Sources team type/enhancement New feature or request

Comments

@akashkulk
Copy link
Contributor

While calculating the estimated rows to be synced for Source Postgres connector the count() operation is currently used. However, this is not optimal as :

  • This operation locks the table, so it cannot be performed while reading the data
  • This operation can result in a full table scan (slow)

There is a faster way to get a very rough estimate :
select reltuples::int8 as count from pg_class c JOIN pg_catalog.pg_namespace n ON n.oid=c.relnamespace where nspname='20m_users' AND relname='users';

However, the issue is that this is wildly incorrect (have seen it return -1) for smaller tables. We'd like to try the fast query and fall back to the slow count if it is negative.

See discussion : #20783

@akashkulk akashkulk added type/enhancement New feature or request needs-triage team/db-dw-sources Backlog for Database and Data Warehouse Sources team labels Jan 17, 2023
@bleonard
Copy link
Contributor

It seems like this not get the right number when incremental because we aren't fetching to full table. is that right or am I missing something?

@akashkulk
Copy link
Contributor Author

akashkulk commented Jan 20, 2023

No, the above query is just an estimation. So, for extremely small tables this query returns invalid results (e.g. -1 rows). We still need to determine the total number of rows in the table in incremental mode, since the query we use to calculate table size pg_relation_size(table_name) returns the size of the entire table. Then, we can scale the amount of data (bytes) to be synced based on the percentage of the table we are actually syncing.

The logic while emitting estimate trace messages in incremental mode does calculate the right number of rows and bytes. However, it does this by issuing a.
select count(*) where cursor_field > cursor_value query. The above fast query to estimate incremental rows, so this optimization only applies for full refresh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage team/db-dw-sources Backlog for Database and Data Warehouse Sources team type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants