-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
format: add statistics for tables, columns, queries, etc. #685
Comments
Possibly: add
so for a table, you would have a row per column per statistic type, and a row per statistic type (for table-wide statistics: row count only) (do we want to/how would we account for partitions, in the sense of Flight SQL, etc.?) |
This comment was marked as resolved.
This comment was marked as resolved.
Apache Hivehttps://cwiki.apache.org/confluence/display/Hive/StatsDev It's hard to find a definitive reference from just documentation.
What is 'bitVector' statistic? Appears to be the serialization of a NumDistinctValueEstimator which is either a Flajolet-Martin sketch or HyperLogLog. So it appears to be something fairly internal that gets exposed as a statistic. Poking at a Hive Metastore instance, it seems bitVector, histogram are never set, and string columns don't record min/max. JDBC
=> We may want a statistic for "abstract size"? (But the values wouldn't be comparable between drivers.) ODBC
Microsoft SQL Server
PostgreSQLhttps://www.postgresql.org/docs/current/planner-stats.html and https://www.postgresql.org/docs/current/view-pg-stats.html
=> How should we define ndv? Snowflakehttps://docs.snowflake.com/en/sql-reference/info-schema/tables
|
So proposal is for:
The result set has schema:
unknown values should be null, or the whole row should simply be omitted Questions:
|
Other potential statistics:
|
Intends to tackle: - apache#621 - apache#685 - apache#736 - apache#755
Intends to tackle: - apache#621 - apache#685 - apache#736 - apache#755
Intends to tackle: - apache#621 - apache#685 - apache#736 - apache#755
In addition to Calcite, Hive contains a fairly decent statistics set, especially since it has column statistics as well as table statistics. |
Ah, thanks for the pointer. It seems Hive also stores min/max N values, histograms, percentiles, and average/sum of numeric columns. This overlaps somewhat with PostgreSQL, so maybe we should try to support them. That said, encoding polymorphic types (if we want min/max of say a string column) and list types is a bit of a pain in Arrow (for min/max N, histograms, etc.) but it's doable via a union. |
Intends to tackle: - apache#621 - apache#685 - apache#736 - apache#755
The proposal was updated to include min/max value and max byte width as standardized statistics. Digging into the Hive code, histograms are implemented but top/bottom K never were. The proposal allows for backends to return custom statistics so Hive/Postgres could still encode histograms (the encoding with Arrow gets very messy, however, given the lack of an 'any' type; they would have to pack the histogram values into a binary column). |
Looking at JDBC drivers:
It seems we shouldn't expect much here from JDBC (and to be fair, getIndexInfo was meant to get info about indices, not really get us detailed statistics), so if we want detailed statistics we'll have to do it per database. |
Intends to tackle: - apache#621 - apache#685 - apache#736 - apache#755
- ADBC_INFO_DRIVER_ADBC_VERSION - StatementExecuteSchema (apache#318) - ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319) - Get/SetOption - error_details (apache#755) - GetStatistics (apache#685)
- ADBC_INFO_DRIVER_ADBC_VERSION - StatementExecuteSchema (apache#318) - ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319) - Get/SetOption - error_details (apache#755) - GetStatistics (apache#685) - New ingest modes (apache#541)
- ADBC_INFO_DRIVER_ADBC_VERSION - StatementExecuteSchema (apache#318) - ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319) - error_details (apache#755) - GetStatistics (apache#685)
- ADBC_INFO_DRIVER_ADBC_VERSION - StatementExecuteSchema (apache#318) - ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319) - Get/SetOption - error_details (apache#755) - GetStatistics (apache#685) - New ingest modes (apache#541)
- ADBC_INFO_DRIVER_ADBC_VERSION - StatementExecuteSchema (apache#318) - ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319) - Get/SetOption - error_details (apache#755) - GetStatistics (apache#685) - New ingest modes (apache#541)
- ADBC_INFO_DRIVER_ADBC_VERSION - StatementExecuteSchema (apache#318) - ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319) - Get/SetOption - error_details (apache#755) - GetStatistics (apache#685) - New ingest modes (apache#541)
- ADBC_INFO_DRIVER_ADBC_VERSION - StatementExecuteSchema (apache#318) - ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319) - Get/SetOption - error_details (apache#755) - GetStatistics (apache#685) - New ingest modes (apache#541)
More research is needed on what systems typically support.
This would make ADBC more useful in situations where it supplies data to other systems, since then those systems could query statistics using a standard interface and integrate them into query planning. (Interestingly, Spark at least doesn't seem to have this in DataSourceV2 - I suppose the smarts are directly in their JDBC support.)
Examples:
The text was updated successfully, but these errors were encountered: