Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

format: add statistics for tables, columns, queries, etc. #685

Closed
lidavidm opened this issue May 17, 2023 · 9 comments · Fixed by #971
Closed

format: add statistics for tables, columns, queries, etc. #685

lidavidm opened this issue May 17, 2023 · 9 comments · Fixed by #971

Comments

@lidavidm
Copy link
Member

More research is needed on what systems typically support.

This would make ADBC more useful in situations where it supplies data to other systems, since then those systems could query statistics using a standard interface and integrate them into query planning. (Interestingly, Spark at least doesn't seem to have this in DataSourceV2 - I suppose the smarts are directly in their JDBC support.)

Examples:

  • Estimated result set size for queries (basically, a semi-standardized EXPLAIN (ANALYZE))
  • Parquet-style min/max/null % statistics per column
@lidavidm
Copy link
Member Author

lidavidm commented Jun 12, 2023

Possibly: add ConnectionGetStatistics whose result set schema is something like

- catalog
- db_schema
- table_name
- statistic_type: null percentage, row count, ndv
- column_name: str (null if table-wide statistics)
- value: double

so for a table, you would have a row per column per statistic type, and a row per statistic type (for table-wide statistics: row count only)

(do we want to/how would we account for partitions, in the sense of Flight SQL, etc.?)

@lidavidm

This comment was marked as resolved.

@lidavidm
Copy link
Member Author

lidavidm commented Jun 15, 2023

Apache Hive

https://cwiki.apache.org/confluence/display/Hive/StatsDev
https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive
https://subscription.packtpub.com/book/data/9781782161080/8/ch08lvl1sec88/column-statistics-in-hive
https://github.com/apache/hive/blob/d0a06239b09396d1b7a6414d85011f9a20f8486a/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java
https://github.com/apache/hive/blob/d0a06239b09396d1b7a6414d85011f9a20f8486a/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java
https://issues.apache.org/jira/browse/HIVE-1362

It's hard to find a definitive reference from just documentation.

  • Provides partition-level statistics (but we don't really expose a concept of partitions, except during query execution time but those would be 'different' partitions)
  • Min/max values for numeric columns
  • Null count (not percentage)
  • NDV
  • Average and maximum column sizes
  • Number of true/false values
  • Histograms ('height-balanced' so the encoding is presumably a list of values dividing the range into equal-sized buckets)
  • Top/bottom K values (proposed, never implemented? HIVE-3421)
  • Thrift API supports typed statistics

What is 'bitVector' statistic? Appears to be the serialization of a NumDistinctValueEstimator which is either a Flajolet-Martin sketch or HyperLogLog. So it appears to be something fairly internal that gets exposed as a statistic.

Poking at a Hive Metastore instance, it seems bitVector, histogram are never set, and string columns don't record min/max.

JDBC

getIndexInfo

  • Oriented around database indices
  • Provides row count and ndv
  • Also provides page count and ordering
  • Differentiates between exact/approximate; this can affect performance of the call

=> We may want a statistic for "abstract size"? (But the values wouldn't be comparable between drivers.)
=> Is ordering useful?
=> Exact/approximate may also be useful to indicate

ODBC

SQLStatistics

  • Effectively the same as JDBC

Microsoft SQL Server

https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-show-statistics-transact-sql?view=sql-server-ver16

  • Row count
  • "density" which is basically 1/ndv, except it looks deprecated?

PostgreSQL

https://www.postgresql.org/docs/current/planner-stats.html and https://www.postgresql.org/docs/current/view-pg-stats.html

  • Row count (may be out of date), page count
  • null percentage
  • a very odd ndv estimate
  • histograms, most common elements, etc.
  • average column width (~= average size of a row in a column; i.e. an estimate of string length, etc?)

=> How should we define ndv?
=> Do we want to be able to map through "most common elements" etc. or should we leave that alone? (Probably leave it alone)
=> We may want to define a column width statistic

Snowflake

https://docs.snowflake.com/en/sql-reference/info-schema/tables

  • Row count
  • Byte size

@lidavidm
Copy link
Member Author

lidavidm commented Jun 15, 2023

So proposal is for:

AdbcConnectionGetStatistics(struct AdbcConnection*, const char* catalog, const char* db_schema, const char* table_name, bool approximate, struct ArrowArrayStream* out);

- Parameters allow filtering down to an individual table, or you can request data for multiple tables at once
- "approximate" is an enum allowing you to request exact statistics, or just get approximate/best-effort/out of date statistics

The result set has schema:

- catalog: str
- db_schema: str
- table_name: str
- statistic_type: str (one of null percentage, row count, ndv, byte_width or a database-specific value)
- column_name: str (null if table-wide statistics)
- value: double
- is_approximate: bool
  • column_name is null if the statistic applies to the whole table
  • null_percentage is a value in [0, 1] representing the % of rows in the column that are null
  • ndv is the number of distinct values in the column (I'm tempted to take the PostgreSQL definition: positive means a fixed number of distinct values, negative means a percentage of distinct values. Or, break this into two separate statistic types)
  • row count is a value in [0, inf)
  • byte_width is a value in [0, inf) representing the average size in bytes of a row in the column (e.g. for a string column, this would be the average string size)

unknown values should be null, or the whole row should simply be omitted

Questions:

  • Do we care about min, max, etc.? IMO no, this complicates the encoding of the 'value' in the result, and utility is questionable. (But maybe we do want value to be at least union[double, string] to perhaps allow for this?)
  • Do we encode the statistic names as strings, or requiring dictionary encoding, or specifying an enumeration? (I would prefer dictionary encoding, but this complicates implementation a bit. The benefit is that if we specify some fixed dictionary values, we can save space on the common values and avoid lots of string comparisons while still allowing self-describing extensibility by vendors)

@lidavidm
Copy link
Member Author

Other potential statistics:

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Jun 20, 2023
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Jun 20, 2023
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Jun 20, 2023
@adamkennedy
Copy link

In addition to Calcite, Hive contains a fairly decent statistics set, especially since it has column statistics as well as table statistics.

@lidavidm
Copy link
Member Author

Ah, thanks for the pointer. It seems Hive also stores min/max N values, histograms, percentiles, and average/sum of numeric columns.

This overlaps somewhat with PostgreSQL, so maybe we should try to support them. That said, encoding polymorphic types (if we want min/max of say a string column) and list types is a bit of a pain in Arrow (for min/max N, histograms, etc.) but it's doable via a union.

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Jun 21, 2023
@lidavidm
Copy link
Member Author

The proposal was updated to include min/max value and max byte width as standardized statistics. Digging into the Hive code, histograms are implemented but top/bottom K never were. The proposal allows for backends to return custom statistics so Hive/Postgres could still encode histograms (the encoding with Arrow gets very messy, however, given the lack of an 'any' type; they would have to pack the histogram values into a binary column).

@lidavidm
Copy link
Member Author

Looking at JDBC drivers:

  • PostgreSQL basically just gives you row count of columns with indices on them (including primary key). The debugger shows it executes this query internally:

    SELECT 
      tmp.TABLE_CAT, 
      tmp.TABLE_SCHEM, 
      tmp.TABLE_NAME, 
      tmp.NON_UNIQUE, 
      tmp.INDEX_QUALIFIER, 
      tmp.INDEX_NAME, 
      tmp.TYPE, 
      tmp.ORDINAL_POSITION, 
      trim(
        both '"' 
        from 
          pg_catalog.pg_get_indexdef(
            tmp.CI_OID, tmp.ORDINAL_POSITION, 
            false
          )
      ) AS COLUMN_NAME, 
      CASE tmp.AM_NAME WHEN 'btree' THEN CASE tmp.I_INDOPTION[tmp.ORDINAL_POSITION - 1] & 1 :: smallint WHEN 1 THEN 'D' ELSE 'A' END ELSE NULL END AS ASC_OR_DESC, 
      tmp.CARDINALITY, 
      tmp.PAGES, 
      tmp.FILTER_CONDITION 
    FROM 
      (
        SELECT 
          NULL AS TABLE_CAT, 
          n.nspname AS TABLE_SCHEM, 
          ct.relname AS TABLE_NAME, 
          NOT i.indisunique AS NON_UNIQUE, 
          NULL AS INDEX_QUALIFIER, 
          ci.relname AS INDEX_NAME, 
          CASE i.indisclustered WHEN true THEN 1 ELSE CASE am.amname WHEN 'hash' THEN 2 ELSE 3 END END AS TYPE, 
          (
            information_schema._pg_expandarray(i.indkey)
          ).n AS ORDINAL_POSITION, 
          ci.reltuples AS CARDINALITY, 
          ci.relpages AS PAGES, 
          pg_catalog.pg_get_expr(i.indpred, i.indrelid) AS FILTER_CONDITION, 
          ci.oid AS CI_OID, 
          i.indoption AS I_INDOPTION, 
          am.amname AS AM_NAME 
        FROM 
          pg_catalog.pg_class ct 
          JOIN pg_catalog.pg_namespace n ON (ct.relnamespace = n.oid) 
          JOIN pg_catalog.pg_index i ON (ct.oid = i.indrelid) 
          JOIN pg_catalog.pg_class ci ON (ci.oid = i.indexrelid) 
          JOIN pg_catalog.pg_am am ON (ci.relam = am.oid) 
        WHERE 
          true 
          AND ct.relname = 'mytable'
      ) AS tmp 
    ORDER BY 
      NON_UNIQUE, 
      TYPE, 
      INDEX_NAME, 
      ORDINAL_POSITION
  • SQL Server gives you the row count for the table, and tells you about columns with a UNIQUE constraint, but doesn't give you much else. Internally it runs EXEC sp_statistics_100 ?,?,?,?,?, ?

It seems we shouldn't expect much here from JDBC (and to be fair, getIndexInfo was meant to get info about indices, not really get us detailed statistics), so if we want detailed statistics we'll have to do it per database.

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Jun 23, 2023
lidavidm added a commit that referenced this issue Jun 27, 2023
lidavidm added a commit that referenced this issue Jun 27, 2023
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Jun 29, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (apache#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319)
- Get/SetOption
- error_details (apache#755)
- GetStatistics (apache#685)
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Jun 29, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (apache#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319)
- Get/SetOption
- error_details (apache#755)
- GetStatistics (apache#685)
- New ingest modes (apache#541)
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Jul 5, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (apache#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319)
- error_details (apache#755)
- GetStatistics (apache#685)
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Jul 10, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (apache#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319)
- Get/SetOption
- error_details (apache#755)
- GetStatistics (apache#685)
- New ingest modes (apache#541)
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Jul 10, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (apache#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319)
- Get/SetOption
- error_details (apache#755)
- GetStatistics (apache#685)
- New ingest modes (apache#541)
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Jul 10, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (apache#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319)
- Get/SetOption
- error_details (apache#755)
- GetStatistics (apache#685)
- New ingest modes (apache#541)
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Jul 10, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (apache#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319)
- Get/SetOption
- error_details (apache#755)
- GetStatistics (apache#685)
- New ingest modes (apache#541)
lidavidm added a commit that referenced this issue Jul 11, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- Get/SetOption
- error_details (#755)
- GetStatistics (#685)
- New ingest modes (#541)
lidavidm added a commit that referenced this issue Jul 11, 2023
lidavidm added a commit that referenced this issue Jul 20, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- error_details (#755)
- GetStatistics (#685)
lidavidm added a commit that referenced this issue Jul 20, 2023
lidavidm added a commit that referenced this issue Jul 20, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- Get/SetOption
- error_details (#755)
- GetStatistics (#685)
- New ingest modes (#541)
lidavidm added a commit that referenced this issue Jul 20, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- error_details (#755)
- GetStatistics (#685)
lidavidm added a commit that referenced this issue Jul 21, 2023
lidavidm added a commit that referenced this issue Jul 21, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- Get/SetOption
- error_details (#755)
- GetStatistics (#685)
- New ingest modes (#541)
lidavidm added a commit that referenced this issue Jul 21, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- error_details (#755)
- GetStatistics (#685)
lidavidm added a commit that referenced this issue Jul 21, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- Get/SetOption
- error_details (#755)
- GetStatistics (#685)
- New ingest modes (#541)
lidavidm added a commit that referenced this issue Jul 21, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- error_details (#755)
- GetStatistics (#685)
lidavidm added a commit that referenced this issue Aug 3, 2023
lidavidm added a commit that referenced this issue Aug 3, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- Get/SetOption
- error_details (#755)
- GetStatistics (#685)
- New ingest modes (#541)
lidavidm added a commit that referenced this issue Aug 3, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- error_details (#755)
- GetStatistics (#685)
lidavidm added a commit that referenced this issue Aug 10, 2023
lidavidm added a commit that referenced this issue Aug 10, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- Get/SetOption
- error_details (#755)
- GetStatistics (#685)
- New ingest modes (#541)
lidavidm added a commit that referenced this issue Aug 10, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- error_details (#755)
- GetStatistics (#685)
lidavidm added a commit that referenced this issue Aug 10, 2023
lidavidm added a commit that referenced this issue Aug 10, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- Get/SetOption
- error_details (#755)
- GetStatistics (#685)
- New ingest modes (#541)
lidavidm added a commit that referenced this issue Aug 10, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- error_details (#755)
- GetStatistics (#685)
lidavidm added a commit that referenced this issue Aug 10, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- Get/SetOption
- error_details (#755)
- GetStatistics (#685)
- New ingest modes (#541)
lidavidm added a commit that referenced this issue Aug 10, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- error_details (#755)
- GetStatistics (#685)
lidavidm added a commit that referenced this issue Aug 28, 2023
lidavidm added a commit that referenced this issue Aug 28, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- Get/SetOption
- error_details (#755)
- GetStatistics (#685)
- New ingest modes (#541)
lidavidm added a commit that referenced this issue Aug 28, 2023
- ADBC_INFO_DRIVER_ADBC_VERSION
- StatementExecuteSchema (#318)
- ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319)
- error_details (#755)
- GetStatistics (#685)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants