format: add statistics for tables, columns, queries, etc. #685

lidavidm · 2023-05-17T13:01:16Z

More research is needed on what systems typically support.

This would make ADBC more useful in situations where it supplies data to other systems, since then those systems could query statistics using a standard interface and integrate them into query planning. (Interestingly, Spark at least doesn't seem to have this in DataSourceV2 - I suppose the smarts are directly in their JDBC support.)

Examples:

Estimated result set size for queries (basically, a semi-standardized EXPLAIN (ANALYZE))
Parquet-style min/max/null % statistics per column

lidavidm · 2023-06-12T17:28:27Z

Possibly: add ConnectionGetStatistics whose result set schema is something like

- catalog
- db_schema
- table_name
- statistic_type: null percentage, row count, ndv
- column_name: str (null if table-wide statistics)
- value: double

so for a table, you would have a row per column per statistic type, and a row per statistic type (for table-wide statistics: row count only)

(do we want to/how would we account for partitions, in the sense of Flight SQL, etc.?)

lidavidm · 2023-06-15T15:21:20Z

Apache Hive

https://cwiki.apache.org/confluence/display/Hive/StatsDev
https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive
https://subscription.packtpub.com/book/data/9781782161080/8/ch08lvl1sec88/column-statistics-in-hive
https://github.com/apache/hive/blob/d0a06239b09396d1b7a6414d85011f9a20f8486a/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java
https://github.com/apache/hive/blob/d0a06239b09396d1b7a6414d85011f9a20f8486a/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java
https://issues.apache.org/jira/browse/HIVE-1362

It's hard to find a definitive reference from just documentation.

Provides partition-level statistics (but we don't really expose a concept of partitions, except during query execution time but those would be 'different' partitions)
Min/max values for numeric columns
Null count (not percentage)
NDV
Average and maximum column sizes
Number of true/false values
Histograms ('height-balanced' so the encoding is presumably a list of values dividing the range into equal-sized buckets)
Top/bottom K values (proposed, never implemented? HIVE-3421)
Thrift API supports typed statistics

What is 'bitVector' statistic? Appears to be the serialization of a NumDistinctValueEstimator which is either a Flajolet-Martin sketch or HyperLogLog. So it appears to be something fairly internal that gets exposed as a statistic.

Poking at a Hive Metastore instance, it seems bitVector, histogram are never set, and string columns don't record min/max.

JDBC

getIndexInfo

Oriented around database indices
Provides row count and ndv
Also provides page count and ordering
Differentiates between exact/approximate; this can affect performance of the call

=> We may want a statistic for "abstract size"? (But the values wouldn't be comparable between drivers.)
=> Is ordering useful?
=> Exact/approximate may also be useful to indicate

ODBC

SQLStatistics

Effectively the same as JDBC

Microsoft SQL Server

https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-show-statistics-transact-sql?view=sql-server-ver16

Row count
"density" which is basically 1/ndv, except it looks deprecated?

PostgreSQL

https://www.postgresql.org/docs/current/planner-stats.html and https://www.postgresql.org/docs/current/view-pg-stats.html

Row count (may be out of date), page count
null percentage
a very odd ndv estimate
histograms, most common elements, etc.
average column width (~= average size of a row in a column; i.e. an estimate of string length, etc?)

=> How should we define ndv?
=> Do we want to be able to map through "most common elements" etc. or should we leave that alone? (Probably leave it alone)
=> We may want to define a column width statistic

Snowflake

https://docs.snowflake.com/en/sql-reference/info-schema/tables

Row count
Byte size

lidavidm · 2023-06-15T16:45:06Z

So proposal is for:

AdbcConnectionGetStatistics(struct AdbcConnection*, const char* catalog, const char* db_schema, const char* table_name, bool approximate, struct ArrowArrayStream* out);

- Parameters allow filtering down to an individual table, or you can request data for multiple tables at once
- "approximate" is an enum allowing you to request exact statistics, or just get approximate/best-effort/out of date statistics

The result set has schema:

- catalog: str
- db_schema: str
- table_name: str
- statistic_type: str (one of null percentage, row count, ndv, byte_width or a database-specific value)
- column_name: str (null if table-wide statistics)
- value: double
- is_approximate: bool

column_name is null if the statistic applies to the whole table
null_percentage is a value in [0, 1] representing the % of rows in the column that are null
ndv is the number of distinct values in the column (I'm tempted to take the PostgreSQL definition: positive means a fixed number of distinct values, negative means a percentage of distinct values. Or, break this into two separate statistic types)
row count is a value in [0, inf)
byte_width is a value in [0, inf) representing the average size in bytes of a row in the column (e.g. for a string column, this would be the average string size)

unknown values should be null, or the whole row should simply be omitted

Questions:

Do we care about min, max, etc.? IMO no, this complicates the encoding of the 'value' in the result, and utility is questionable. (But maybe we do want value to be at least union[double, string] to perhaps allow for this?)
Do we encode the statistic names as strings, or requiring dictionary encoding, or specifying an enumeration? (I would prefer dictionary encoding, but this complicates implementation a bit. The benefit is that if we specify some fixed dictionary values, we can save space on the common values and avoid lots of string comparisons while still allowing self-describing extensibility by vendors)

lidavidm · 2023-06-15T16:51:43Z

Other potential statistics:

Calcite defines these: https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/BuiltInMetadata.html

If we were to provide a direct equivalent of getIndexInfo, we could possibly also map ColumnUniqueness, Collation. It may work to allow column_name to be list<str>.

Intends to tackle: - apache#621 - apache#685 - apache#736 - apache#755

adamkennedy · 2023-06-20T20:27:33Z

In addition to Calcite, Hive contains a fairly decent statistics set, especially since it has column statistics as well as table statistics.

lidavidm · 2023-06-20T22:07:26Z

Ah, thanks for the pointer. It seems Hive also stores min/max N values, histograms, percentiles, and average/sum of numeric columns.

This overlaps somewhat with PostgreSQL, so maybe we should try to support them. That said, encoding polymorphic types (if we want min/max of say a string column) and list types is a bit of a pain in Arrow (for min/max N, histograms, etc.) but it's doable via a union.

Intends to tackle: - apache#621 - apache#685 - apache#736 - apache#755

lidavidm · 2023-06-21T13:42:47Z

The proposal was updated to include min/max value and max byte width as standardized statistics. Digging into the Hive code, histograms are implemented but top/bottom K never were. The proposal allows for backends to return custom statistics so Hive/Postgres could still encode histograms (the encoding with Arrow gets very messy, however, given the lack of an 'any' type; they would have to pack the histogram values into a binary column).

lidavidm · 2023-06-21T19:18:43Z

Looking at JDBC drivers:

PostgreSQL basically just gives you row count of columns with indices on them (including primary key). The debugger shows it executes this query internally:

SELECT 
  tmp.TABLE_CAT, 
  tmp.TABLE_SCHEM, 
  tmp.TABLE_NAME, 
  tmp.NON_UNIQUE, 
  tmp.INDEX_QUALIFIER, 
  tmp.INDEX_NAME, 
  tmp.TYPE, 
  tmp.ORDINAL_POSITION, 
  trim(
    both '"' 
    from 
      pg_catalog.pg_get_indexdef(
        tmp.CI_OID, tmp.ORDINAL_POSITION, 
        false
      )
  ) AS COLUMN_NAME, 
  CASE tmp.AM_NAME WHEN 'btree' THEN CASE tmp.I_INDOPTION[tmp.ORDINAL_POSITION - 1] & 1 :: smallint WHEN 1 THEN 'D' ELSE 'A' END ELSE NULL END AS ASC_OR_DESC, 
  tmp.CARDINALITY, 
  tmp.PAGES, 
  tmp.FILTER_CONDITION 
FROM 
  (
    SELECT 
      NULL AS TABLE_CAT, 
      n.nspname AS TABLE_SCHEM, 
      ct.relname AS TABLE_NAME, 
      NOT i.indisunique AS NON_UNIQUE, 
      NULL AS INDEX_QUALIFIER, 
      ci.relname AS INDEX_NAME, 
      CASE i.indisclustered WHEN true THEN 1 ELSE CASE am.amname WHEN 'hash' THEN 2 ELSE 3 END END AS TYPE, 
      (
        information_schema._pg_expandarray(i.indkey)
      ).n AS ORDINAL_POSITION, 
      ci.reltuples AS CARDINALITY, 
      ci.relpages AS PAGES, 
      pg_catalog.pg_get_expr(i.indpred, i.indrelid) AS FILTER_CONDITION, 
      ci.oid AS CI_OID, 
      i.indoption AS I_INDOPTION, 
      am.amname AS AM_NAME 
    FROM 
      pg_catalog.pg_class ct 
      JOIN pg_catalog.pg_namespace n ON (ct.relnamespace = n.oid) 
      JOIN pg_catalog.pg_index i ON (ct.oid = i.indrelid) 
      JOIN pg_catalog.pg_class ci ON (ci.oid = i.indexrelid) 
      JOIN pg_catalog.pg_am am ON (ci.relam = am.oid) 
    WHERE 
      true 
      AND ct.relname = 'mytable'
  ) AS tmp 
ORDER BY 
  NON_UNIQUE, 
  TYPE, 
  INDEX_NAME, 
  ORDINAL_POSITION

SQL Server gives you the row count for the table, and tells you about columns with a UNIQUE constraint, but doesn't give you much else. Internally it runs EXEC sp_statistics_100 ?,?,?,?,?, ?

It seems we shouldn't expect much here from JDBC (and to be fair, getIndexInfo was meant to get info about indices, not really get us detailed statistics), so if we want detailed statistics we'll have to do it per database.

Intends to tackle: - apache#621 - apache#685 - apache#736 - apache#755

Intends to tackle: - #621 - #685 - #736 - #755

- ADBC_INFO_DRIVER_ADBC_VERSION - StatementExecuteSchema (apache#318) - ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319) - Get/SetOption - error_details (apache#755) - GetStatistics (apache#685)

- ADBC_INFO_DRIVER_ADBC_VERSION - StatementExecuteSchema (apache#318) - ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319) - Get/SetOption - error_details (apache#755) - GetStatistics (apache#685) - New ingest modes (apache#541)

- ADBC_INFO_DRIVER_ADBC_VERSION - StatementExecuteSchema (apache#318) - ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319) - error_details (apache#755) - GetStatistics (apache#685)

- ADBC_INFO_DRIVER_ADBC_VERSION - StatementExecuteSchema (apache#318) - ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (apache#319) - Get/SetOption - error_details (apache#755) - GetStatistics (apache#685) - New ingest modes (apache#541)

- ADBC_INFO_DRIVER_ADBC_VERSION - StatementExecuteSchema (#318) - ADBC_CONNECTION_OPTION_CURRENT_{CATALOG, DB_SCHEMA} (#319) - Get/SetOption - error_details (#755) - GetStatistics (#685) - New ingest modes (#541)

Intends to tackle: - #621 - #685 - #736 - #755