Source Postgres : Use fast select to get table size estimate in Postgres #21499

akashkulk · 2023-01-17T21:22:20Z

While calculating the estimated rows to be synced for Source Postgres connector the count() operation is currently used. However, this is not optimal as :

This operation locks the table, so it cannot be performed while reading the data
This operation can result in a full table scan (slow)

There is a faster way to get a very rough estimate :
select reltuples::int8 as count from pg_class c JOIN pg_catalog.pg_namespace n ON n.oid=c.relnamespace where nspname='20m_users' AND relname='users';

However, the issue is that this is wildly incorrect (have seen it return -1) for smaller tables. We'd like to try the fast query and fall back to the slow count if it is negative.

See discussion : #20783

The text was updated successfully, but these errors were encountered:

bleonard · 2023-01-18T17:05:01Z

It seems like this not get the right number when incremental because we aren't fetching to full table. is that right or am I missing something?

akashkulk · 2023-01-20T22:53:00Z

No, the above query is just an estimation. So, for extremely small tables this query returns invalid results (e.g. -1 rows). We still need to determine the total number of rows in the table in incremental mode, since the query we use to calculate table size pg_relation_size(table_name) returns the size of the entire table. Then, we can scale the amount of data (bytes) to be synced based on the percentage of the table we are actually syncing.

The logic while emitting estimate trace messages in incremental mode does calculate the right number of rows and bytes. However, it does this by issuing a.
select count(*) where cursor_field > cursor_value query. The above fast query to estimate incremental rows, so this optimization only applies for full refresh.

akashkulk added type/enhancement New feature or request needs-triage team/db-dw-sources Backlog for Database and Data Warehouse Sources team labels Jan 17, 2023

akashkulk mentioned this issue Jan 17, 2023

Source Postgres : Emit estimate trace messages for non-CDC mode #20783

Merged

akashkulk mentioned this issue Jan 20, 2023

Source Postgres : Fast query for estimate messages #21683

Merged

akashkulk closed this as completed Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source Postgres : Use fast select to get table size estimate in Postgres #21499

Source Postgres : Use fast select to get table size estimate in Postgres #21499

akashkulk commented Jan 17, 2023

bleonard commented Jan 18, 2023

akashkulk commented Jan 20, 2023 •

edited

Loading

Source Postgres : Use fast select to get table size estimate in Postgres #21499

Source Postgres : Use fast select to get table size estimate in Postgres #21499

Comments

akashkulk commented Jan 17, 2023

bleonard commented Jan 18, 2023

akashkulk commented Jan 20, 2023 • edited Loading

akashkulk commented Jan 20, 2023 •

edited

Loading