Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

table statistics and selectivity for query cost #927

Open
jchappelow opened this issue Aug 22, 2024 · 1 comment
Open

table statistics and selectivity for query cost #927

jchappelow opened this issue Aug 22, 2024 · 1 comment

Comments

@jchappelow
Copy link
Member

jchappelow commented Aug 22, 2024

Related to #410, this issue describes the main tasks related to table statistics.

Purpose

The goal of statistics in a DB engine's query planner is to find an optimal (i.e. faster) plan in terms of the cost of the candidate plans. In Kwil, the goal is to price the execution of a transaction (or part of a transaction) fairly and defensively. Thus, we balance competing objectives:

  1. Prevent transactions that are costly to execute in terms of node resources (CPU/RAM/disk/time) but are cheap in terms of gas for the user. This is a defensive consideration.
  2. Fairly price transaction execution such that users can actually afford to query large datasets without paying exorbitant cost.

Statistics and Selectivity

Statistics on data in user datasets is necessary to obtaining selectivity estimates for conditional clauses (e.g. WHERE x < $1) by providing the cost estimation code with a more accurate count of the number of rows that pass through any logical plans with the conditional as an input plan in the tree. Considered together with the expressions and data types, this should reflect the cost of execution. For more information, see https://www.postgresql.org/docs/current/row-estimation-examples.html

Selectivity Overview

The general idea of selectivity is rows = rel_cardinality * selectivity, where the number of rows in a table is rel_cardinality, and we wish to know the number, rows, that would be selected by some condition with selectivity, some number on [0,1]. Selectivity is thus based on some knowledge of the distribution of values in a column, and some provided value (e.g. in $1 in the lt binary operator above).

NOTE: if we only wanted to achieve the first goal in the Purpose section (defensiveness), only the table's total row count (rel_cardinality) would need to be considered. Providing a fair cost given a query that processes or returns far fewer rows requires estimating selectivity and maintaining statistics to do that.

Statistics

To produce meaningful selectivity estimates, we are defining column statistics with several key pieces of information:

  • row count (actually a table-level statistic)
  • min and max values, and their observation counts
  • most common values (MCVs), which are a pair of arrays: the most common values, and their frequencies. The length of these arrays has a limit, analogous to the default_statistics_target with postgres.
  • a histogram, which is populated after the MCVs reaches their capacity
  • (future) distinct value count, which requires special approximation algorithms to avoid the possibility of loading the entire dataset into memory if the column has no repetition

The MCVs and histogram together account for 100% of all rows observed.

MCVs

The most common value (MCV) set is:

  • a MCVals []any (or []T with generics) that contains the value of the column's type, in ascending order
  • a MCFreqs []int that contains the corresponding occurrence count of each of the values in MCVals

These have a hard capacity limit. In a full table scan, this capacity may be reached well before all values have been observed. I toyed with a multi-pass scan approach, but it was costly and complex. The approach use by PostgreSQL internally is to spill new values into a histogram when the MCV set is at capacity.

Histogram

A histogram is an array of counts, where each count is the number of times a value within a bin was observed. A bin has boundaries in the units of the column's type. PostgreSQL builds histograms where each bin has the same number of counts (bins have different widths), allowing it to omit the counts array. We are constrained by a single pass scan and the need to perform updates continually, so we use a traditional histogram that has uniform widths and different counts per bin. Rebalancing of the histogram is costly and only done in a full refresh of the statistics.

NOTE: "the histogram does not include the portion of the column population represented by the MCVs"

Defining a histogram's bounds involves interpolation for every supported type of value. The full range of the histogram is based off the values observed up to that point in the scan (when MCVs begin to spill). The first and last bin are catch-all bins for out-of-range values.

Computing Selectivity

It is best to read https://www.postgresql.org/docs/current/row-estimation-examples.html

In short, given a condition like WHERE X < 42, the combined observations in the MCVs and histogram give a reduced row count. Since the MCVals array is sorted, the corresponding MCFreqs are summed up to the location determined by the condition. However, the MCVs list may account for, say 40%, of all values in the table, while the histogram accounts for the rest, in this example 60%. The histogram selectivity and fraction are then considered together with the MCV selectivity.

This is a basic example. The condition can be (in)equality, or a combination of conditions such as a range, etc. See the docs page linked above for more on other cases.

Building and Maintaining Statistics

Full Scan

A full scan of a table (i.e. SELECT * FROM ... ORDER BY pk, ...) is a method for building statistics from scratch. This is the "ground truth" statistics for a table. We may consider a postgres extension that uses SPI to do this and avoid protocol overhead incurred by doing this in the application.

When a new dataset is created, tables are empty and this is not needed. Where a full scan is applicable is periodic refresh/rebuild of table statistics, and perhaps recovery. See the next section on Updates for an explanation of why periodic refresh is needed.

Continuous Updates

All changes (insert/update/delete) to a table are captured in the logical replication monitor, and these change sets are used to continuously update a table+column's statistics.

Over time, the statistics can become poorer representations of the table than if a full rescan were performed. Some simple examples:

  • deleting (or updating) values will update MCVs sets, but these may no longer be the most common, with other values that should be in the MCVs with the now-deleted values better in the histogram
  • if the MCVs set were full, many inserts of some new value would be spilling into the histogram (as designed), but over time it would be more informative if they were discrete entries in the MCVs

Continuous updates are relatively fast and cheap, but may lead to stale statistics.

Periodic Refresh

To address the above issues with continuous updates leading to stale statistics, datasets are scheduled for periodic refresh of the statistics.

The scheduling is presently based on: (1) current height, (2) the dataset's dbid, and (3) a constant modulo value e.g. 20. This determines if a dataset should have its table statistics rebuilt at a given height. This staggers the rescans, rather than rebuilding all datasets' statistics at the same time. It is also deterministic.

NOTE: PostgreSQL does periodic statistics rebuilds as part of it's "autovacuum" process. AFAIK, it does no continuous updates. The frequency of autovacuum is affected by many factors, and is not predictable between deployments.

Persistence of Statistics

Since the statistics are not rebuilt at every height, and because continuous updates can lead to a drift in the statistics from the ground truth, a node restart cannot simply begin by rebuilding statistics with a full scan. We must store each table's current statistics so that a node will start with exactly the same statistics it had on shutdown (or crash). All nodes will have the same view of statistics (and thus query cost) in this event.

Statistics are polymorphic. This makes it very cumbersome to store the stats in the relational DB. In postgresql, there is a concept of an anyarray, which is used e.g. in the MCVs array. To applications, this concept can only be used as a function arg, not a column type. I considered several possible solutions with tables like creating a stats schema and a table for each col like ds_<dbid>_stats.rel_col (min T, max T, mc_vals T[], mc_freqs int[], ...). Ultimately I could not discern any benefit to that effort, only complexity and possible interference with the DB's intended purpose.

Implemented. Given the above, I opted for a very simple and effective approach: implementing encoding.BinaryMarshaler and encoding.BinaryUnmarshaler for all data types used to represent column statistics, and then using the encoding/gob package to create a blob that contains the statistics that is only understood (or useful) to the kwild application. This can be stored as a bytea in a special statistics table with the fully-qualified table name as the primary key. For now, I'm writing this blob to a file on disk. It is stored at the end of every block. The main down side is that their storage is not atomic with the consensus DB commits, so there are consequences of an ill-timed crash that would require a state rebuild.

Updates within DB transaction (between statements)

This needs investigation. I have a list of ideas including audit tables. I'm working on all the above first.

Computing cost for statements within procedures

The query planner, access to statistics, and the cost model that combines both need to be usable within a procedure.

As a staring point, we can compute cost for all of a procedures SQL statements prior to executing the procedure. Ultimately if we want to consider account for cascading affects from one statement to the next, we need a better solution.

@brennanjl
Copy link
Collaborator

Ok, this is a lot to digest, but I'll do my best to provide feedback and thoughts. Overall, this sounds like what we need, and I can 100% see how we could compute selectivity using a logical query plan.

Most of my concerns are regarding refreshing and persisting statistics deterministically, as well as the ability for intra-procedure updates.

Refreshing + Persistance

I totally understand why we can't have full statistics refresh after every block, however I am unsure how we can ensure deterministic refresh with snapshots. Say we have a network with two nodes that prunes every 100 blocks. One node has been a validator for the entirety of the network, while the other joins at snapshot 100 and only begins playing transactions at block 101.

Say a schema has a statistics refresh at block 98, and then has some data written in block 99. How will the new node (which is starting at snapshot height 100) get the same statistics? I would normally assume that the persisted statistics are part of consensus (and thus replicated within snapshots), however is this possible given that they are not atomically committed with the rest of the block?

Total conjecture, but is there a way to handle this via the Postgres SPI, such that statistics are computed prior to being committed? Even if possible, this would obviously be a non-trivial departure from what you have now, but I'm just spit-balling.

Intra-Procedure Statistics

Unfortunately I think it is probably inevitable that we will need both intra-block and intra-procedure statistics, since there is an obvious attack vector if we don't. We don't need them right now (since this is still a major improvement from the non-existent protection we have right now), but I do think it is inevitable. No further comments other than this, but just getting something working with fixed statistics for the lifetime of a block is adequate right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants