Skip to content
This repository has been archived by the owner on May 22, 2024. It is now read-only.

v12 merge commit message

Heikki Linnakangas edited this page Sep 22, 2020 · 9 revisions

Merge with PostgreSQL version 12 (up to a point between beta2 and beta3).

This is the point where PostgreSQL REL_12_STABLE was branched off the master branch and v13 development started.

See PostgreSQL release notes for v10, v11 and v12 for information on the uptream changes included in this merge:

https://www.postgresql.org/docs/release/10.0/ https://www.postgresql.org/docs/release/11.0/ https://www.postgresql.org/docs/release/12.0/

The two most notable upstream features that had a big impact on GPDB code are:

  1. Partitioning
  2. Table AM API.

The old GPDB partitioning support was completely ripped out and replaced with the upstream code, and new glue code was written to keep the old syntax working on top of the new implementation. Similarly, the AO and AOCO table code was refactored to be table access methods, working under the new Table AM API.

In addition to those big-ticket items, there are many, many smaller changes, detailed in the sections below. This isn't a comprehensive list of all upstream changes, I have only noted items that had a special impact on GPDB, because they work somehow differently from upstream, or they affected existing GPDB code or tests somehow. Also, there are many little things marked with GPDB_12_MERGE_FIXME comments in the code that will need to be addressed after the merge.

Partitioning

All the GPDB-specific code to deal with partitioning was ripped out and replaced with the new PostgreSQL partitioning code. Compatibility for the old GPDB partitioning syntax was reimplemented on top of the upstream code, in the parse analysis phase.

Partitioning features that were removed or substantially changed

  • Multi-column list partitioning is no longer supported. As a work-around, you can create a composite type and use that as the partitioning key, like this:

    create type foo_partkey as (b int, c int); create table foo (a int, b int, c int) partition by list ((row(b, c)::foo_partkey));

    Partition pruning might not work in all the cases that it used to with that work-around though.

  • PostgreSQL range partitions don't support START EXCLUSIVE or END INCLUSIVE. The start boundary is always inclusive and end exclusive. To support the old syntax, we convert EXCLUSIVE to INCLUSIVE by doing "+ 1" on the boundary value. That only works for datatypes that have a suitable "+" operator, like 'integer' and 'timestamp', but not 'float' or 'text'.

  • Addressing a partition by RANK, in "ALTER TABLE ALTER/DROP PARTITION FOR (RANK (1))", is no longer supported.

  • The gp_enable_exchange_default_partition GUC was removed. With upstream partition code exchanging (essentially attaching new) default partition comes with no risk of wrong data. Attaching a default partition code validates that the data does not violate the partition constraints, so blocking the exchange of default partition under GUC is no longer required. It's same as exchanging any other partition.

  • The old GPDB implementation had a concept of "partition name" that was separate from "relation name". The new upstream implementation does not. To keep old ALTER TABLE DROP PARTITION etc. commands that used partition name working as much as possible, they now search for a partition with a rel name that's constructed using the same rules that the legacy CREATE TABLE syntax comaptibility layer uses. So it mostly works, assuming you consistently use the legacy syntax. It is recommended to convert new applications to the new upstream syntax, however, to avoid ambiguity. The rules for forming relation names for partitions using the legacy CREATE TABLE syntax are mostly the same as in earlier GPDB versions, but not necessarily exactly the same in all cases.

  • The old partitioning-related catalog tables, pg_partition and pg_partition_rule are gone, as well as the pg_partition_def() function. Partitioning information is now stored in the upstream pg_partitioned_table catalog, and in pg_class in relispartition and relpartbound fields. PostgreSQL also has three helper functions: pg_partition_ancestors(rel)), pg_partition_root(rel) and pg_partition_tree(rel)

  • Partition boundaries are no longer represented as CHECK constraints. Partition constraints is a completely separate concept now.

  • Partitioned tables have no access method. They used to be heap or AO or AOCO, even though they contained no data, but not anymore.

  • A "partition template" is created if you use the legacy CREATE TABLE with SUBPARTITION TEMPLATE syntax. The partition template is used if you add a partition with ALTER TABLE ADD PARTITION, but not with the upstream CREATE TABLE PARTITION OF syntax. The templates are stored in the new gp_partition_template catalog (previously, in pg_partition_rule). The representation stored in catalog is the "raw" parser output, before resolving type names etc. Partition templates are not currently dumped by pg_dump, so they will be lost on dump & restore or pg_upgrade.

Other notable partioning changes

  • pg_dump always uses the upstream syntax now. That fixes a lot of the issues we've had with dump & restore of complex partitioning cases. See github issues #2647, #2648, #3598, #5003, #6455. (I haven't verified that all of those issues are really fixed by this, but they should be).

  • In PostgreSQL, GRANT/REVOKE on a partitioned table only affects the partitioned table itself, not the partitions. Permission on the partitioned table is enough to query the partitioned table and all its children, but it makes a difference if you query the partition directly, or if the partition is detached. We kept the old GPDB behavior where GRANT/REVOKE recurses on the child partitions. (There is no ONLY syntax on GRANT/REVOKE that could be used to control it.)

  • Similarly, when you create a new partition with CREATE TABLE PARTITION OF, the permissions of the new parent table are copied to the new partition.

  • Updating the partitioning key is now allowed, and the row is moved to the correct partition. Used to throw an error.

  • Unlogged tables are now allowed with regular tables in partition hierarchy. Temporary and permanent tables still cannot be mixed.

  • Inserting into an intermediate partition is now allowed. The row will be routed to the correct partition, like in PostgreSQL. It used to be forbidden in GPDB.

  • Unlike PostgreSQL, we allow COPY TO on a partitioned table. That worked in GPDB before, so we needed to keep it working. It was re-implemented by automatically transforming the "COPY

    TO ..." command to "COPY (SELECT * FROM
    ) TO ...". The hint in the upstream error message suggests doing that, but we now do it automatically. There is already code to do that transformation, for COPY TO when Row-Level Security is enabled, so this was a one line change.
  • REINDEX TABLE on a partitioned table recurses to the partitions, and reindexes all indexes on all partitions. This is different from upstream, where REINDEX TABLE on a partitioned emits a warning and does nothing. But it is the same behavior as on previous versions of GPDB.

  • REINDEX INDEX on a partitioned index is a no-op, as a partitioned table doesn't contain any data. In PostgreSQL, it throws an error, but we have allowed it historically in GPDB.

  • There are behavioral changes in how triggers are fired with partitioned tables; we now follow the upstream rules. Triggers on a partitioned table are fired when you insert/delete/update it. If you update a row from one partition to another, delete+insert triggers are fired. The 'triggers_gp' regression test captures the different combinations, take a look at the changes to the expected output. (Statement triggers are still not supported in GDPB)

  • PARTITION BY is now accepted on both sides of WITH clause in CREATE TABLE. The old GDPB syntax had it after them, the upstream syntax had it before it, so allow both.

  • Lots of error messages related to partitining are now different, since we rely on upstream code.

  • Removed a lot of NOTICEs when partitions are created or dropped.

  • Partition Selectors

    GPDB-specific Partition Selector nodes are still created, for partition pruning on joins, but the planner code to create them was largely rewritten, to leverage the upstream partition pruning code as much as possible:

    • In executor, Partition Selectors now use the upstream partition pruning infrastructure to compute which partitions can contain tuples that pass the join qual. The partitions are recorded in a Bitmapset, like the upstream partition pruning code does, and the Bitmapset is passed from the PartitionSelector to the Append node using a special executor Param. The old hash table that contained the partition OIDs is gone, as is the PartSelected expression. The Append node now directly skips the pruned partitions

    • In planner, use the upstream code to construct the PartitionPruneInfo steps to do the pruning. Thanks to this, join pruning now also works for quals with <, > operators, and for tables partitioned on expressions rather than plain columns.

    • In planner, refactor where the decision to do join pruning is made. It's now decided in create_append_plan(), where run-time pruning is decided, too. To pass the partition pruning info to the Partition Selector above the Append, maintain a stack of joins that are candidates for Partition Selectors as we recurse in create_plan().

    ORCA only supports very simple static partition pruning currently, and falls back for most cases involving partitioned tables. FIXME comments are added to the code and/or tests where that happens now, and work is scheduled to re-integrate more partitioning support later.

    Append-optimized tables and the new Table AM API

    PostgreSQL v12 introduced a new API for table access methods that allows replacing the usual heap storage with something else. The AO and AOCO table code was refactored to sit behind the new API. There are now two extra table AMs in GPDB in addition to the 'heap': 'ao_row' and 'ao_column'. The old syntax, "WITH (appendonly=true)", is now mapped to "CREATE TABLE ... USING ao_row", and similarly for column-orinted AOCO tables.

    The 'appendonly' and 'orientation' options are no longer accepted in the 'gp_default_storage_options' GUC. Its interactions with 'default_table_access_method' and options given in the DDL command became too unwieldy, so it was deemed better to stop supporting it. You can use the upstream 'default_table_access_method' GUC instead.

    There are some other small differences in what table options are accepted in what contexts, where the old behavior was too difficult to maintain exactly. For example, if you specify 'compresslevel' and 'compresstype' in the CREATE TABLE WITH clause, but some of the partitions use heap storage, you now get an error. It used to be accepted, but the options only affected the AO partitions.

    Table inheritance is now allowed for AOCO tables. It was previously disallowed for no discernible reason, and it seems to just work now.

    The table AM API doesn't cover all the places where we had had to modify upstream code previously. There are still some RelationIsAppendOptimized() calls sprinkled in the backend code, but much fewer than before. One big difference is that updates to AO tables need a call to appendonly_dml_init() or aoco_dml_init() to be called before modifying the table, and appendonly_dml_finish() or aoco_dml_finish() afterwards.

    The access method is now represented in the catalogs like in upstream, in the pg_class.relam field. The old GPDB-specific pg_class.relstorage column has been removed.

    Planner changes

    • Big changes to partition planning and pruning, as we merged upstream partititioning code. Most GPDB partitioning code has been replaced with corresponding upstream. See section "Partitioning" for details.

    • PostgreSQL v12 planner learned to inline side-effect free CTEs. That is, if you do "WITH foo (SELECT ...) SELECT * FROM foo, ...", the 'foo' subquery is inlined into the main query and planned as one. Previously, the CTE was always planned separately, and materialized in full. GPDB has always supported inlining CTEs, so this isn't new to GPDB, but we adopted the code changes related to that as much as possible. In PostgreSQL, the feature can be controlled by specifying "WITH foo AS MATERIALIZED (...)" or NOT MATERIALIZED, and that now works in GPDB too.

      In GPDB, the inlining could previously be controlled with the "gp_cte_sharing" GUC. It still works, but if MATERIALIZED or NOT MATERIALIZED was specified in the query it takes precedence. Like in previous releases, Even if gp_cte_sharing is set to 'on', the subquery is inlined if there is only one reference to it in the query.

      When gp_cte_sharing = 'off' (the default), GPDB inlines the subqueries even if it's not strictly legal to do so. See https://github.com/greenplum-db/gpdb/issues/1349. This merge doesn't change that.

    • Partitionwise joins and aggregates were introduced in PostgreSQL v11. They work in GPDB too. The main benefit of partitionwise aggregation is that if the partition hierarchy contains a foreign table, the aggregation can be pushed down to the foreign server.

    • GROUPING SETS can now be implemented by hashing, or a mix of hashing and sorting. It shows up as a MixedAggregate node in EXPLAIN. However, the GPDB-specific multi-stage aggregate planning code (in cdbgroupingpaths.c) has not been updated to take advantage of that, so you only get such plans with one-stage aggregates, when the grouping keys happen to coincide with the table's distribution keys. This is a TODO.

    Other changes

    • The default logging directory, 'pg_log', was renamed to just 'log' in PostgreSQL v10. That might affect tools that would try to look into the logs. (The 'pg_xlog' and 'pg_clog' directories were also renamed to 'pg_wal' and 'pg_xact', respectively, but external tools shouldn't be peeking into those directories directly)

    • JIT compilation, introduced in PostgreSQL v11, works. We haven't done anything extra for JIT compilation of GPDB-specific code, but everything that evaluates expressions benefits from the upstream JIT support, also in GPDB-specific node types.

    • Hash indexes are enabled now. We had disabled them completely in GPDB, because they were not WAL-logged. Since PostgreSQL v10 they are, so we can enable them.

    • "MK sort", in tuplesort_mk.c, is gone. The "gp_enable_mk_sort" GUC is gone too. We always use the upstream sort code now. There have been performance improvements to it in the upstream releases, but this is still a performance regression in cases where the MK-sort was particularly good. That is, if you have multiple ORDER BY columns, with lots of duplicates. We might still reintroduce MK sort later in some form, as a separate PR. But that work should be made on pgsql-hackers for PostgreSQL first.

    • The GPDB-specific "Hybrid Hash Agg" code is gone. Its purpose was to support spilling large hash aggregates to disk, instead of running out of memory. That functionality has been replaced by backporting the functionally similar hash agg spilling feature from PostgreSQL v13. That consisted of a number of upstream commits:

      a10701c0e3 Allow simplehash to use already-calculated hash values. 81080df32b Add transparent block-level memory accounting 87b6b57191 Fix edge case leading to agg transitions skipping ExecAggTransReparent() calls. f76e8f8507 Refactor hash_agg_entry_size(). 4ac4bd07fa Introduce TupleHashTableHash() and LookupTupleHashEntryHash(). 5dddc85d78 Change signature of TupleHashTableHash(). d238458f6d Minor refactor of nodeAgg.c. a9e9e9bc8e Fixup for nodeAgg.c refactor. b27eec5ea6 Extend ExecBuildAggTrans() to support a NULL pointer check. 0a390a8e8b Introduce LogicalTapeSetExtend(). 5d9fd3ae3d Disk-based Hash Aggregation. 5471822147 Fixes for Disk-based Hash Aggregation. 8b6151eae7 Avoid allocating unnecessary zero-sized array. 04fced3688 Fix costing for disk-based hash aggregation. e4ee460db3 Include chunk overhead in hash table entry size estimate. 255871265e Create memory context for HashAgg with a reasonable maxBlockSize. 02f3d5670a Fix HashAgg regression from choosing too many initial buckets. f12207ca4e Logical Tape Set: use min heap for freelist.

      As a result, the nodeAgg.c code is more similar to the v13 version than v12. TODO: backpatch any followup commits from REL_13_STABLE before release.

    • Parallelism is still disabled in GPDB. That includes the new parallel CREATE INDEX code.

    • Fix behavior with UPDATE WHERE CURRENT OF. As you can see in the changes to expected output of the 'portals_updatable' test, some of the test results changed. As far as I can see, the old behaviour memorized in the expected output was incorrect. PostgreSQL has always returned the result we're now seeing on the merge branch, which is different from the result you get on master. The test had not been modified since 2012, when the test was added along with support for WHERE CURRENT OF. I could not find any explanation in git history or in test comments. I believe it was simply a bug when WHERE CURRENT OF support was added, and no one noticed that the test result was wrong.

    • The GPDB implementation of "RANGE offset PRECEDING/FOLLOWING" in window functions was replaced with the upstream implementation that was introduced in version 11. The upstream implementation is more strict about the datatypes than the code in GPDB master. However, looking at old JIRA ticket MPP-5246 that was mentioned in one of the tests, it had a test case like this attached to it:

      explain select cn, count(*) over (order by dt range between '2 day'::interval preceding and 2 preceding) from mpp5246_sale; ERROR: can't coerce leading frame bound to type of trailing frame bound HINT: specify the leading and trailing frame bounds as the same type

      So apparently we were strict about the types back in 2009, and you got that error up to 5X_STABLE, but it got relaxed in 6X_STABLE. And now it becomes more strict with the upstream code again. I think that's fine, although we probably shouldn't have relaxed it in 6X_STABLE. Too late to change that.

    • Remove the undocumented 'gp_indexcheck_vacuum' GUC. A new contrib module, contrib/amcheck, was added in PostgreSQL v10, with similar functionality.

    • Logical replication is still disabled in GPDB.

    • The default of 'hot_standby' was changed to 'on' in upstream, but hot standby is still not supported GPDB, and the default in GPDB remains 'off'. It is now possible to turn it 'on' manually in mirrors, though. That allows tools like pg_basebackup to work, and it allows running the upstream replication tests in src/test/recovery. However, there's no user-visible, supported, new functionality here.

    • pg_basebackup got a feature to verify checksums, but that has been disabled in the wrapper python scripts. So checksums are not verified by the management utilities that use pg_basebackup under the hood (gpexpand, gpinitstandby, gpconfigurenewsegment). Likewise, tests that use pg_basebackup were modified with the --no-verify-checkums flag. The reason for this is that the checksum verification is broken for AO/AOCO tables, which don't have page checksum like other relations do, and pg_basebackup cannot distinguish them. We ought to fix this somehow, but for now it's a FIXME.

    • pg_ctl -w now waits for distributed transaction recovery.

      When starting master in distributed mode, in GPDB pg_ctl -w should wait for distributed transaction recovery to complete. As only after that connections are allowed.

      Adding new state "dtmready" to be recorded in PID file to mark distributed transaction recovery completion. When dtx background worker finishes distributed transaction recovery, postmaster writes this state to file. pg_ctl waits for this "dtmready" state instead of "ready" state is starting master in distributed mode which is conveyed using "-E" postmaster options.

      This change should meet the previous expectations, that after successful completion of gpstart, distributed transaction can be made.

    • If a client tries to connect while DTM recovery in-progress, you now get an error. Client connections used to wait for dtx recovery process to complete distributed transaction recovery. Depending on situation this waiting could be infinite. To avoid hung connections though in certain situation where dtx recovery can't complete the distributed transactions, better to error out and let client retry in this situation similar to local crash recovery situation. (Discussion: https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/dgU9il8l24A/NTAz2cl5BQAJ)

    Source code and test changes

    • PostgreSQL v11 replaced the way the initial catalog contents are stored in the source code. Instead of the "DATA(insert ..." lines in the .h files, the initial contents are stored in JSON format, in separate .dat files. The JSON format is much more convenient to edit, and makes it more convenient to carry diffs against them, so we don't need the GPDB-specific perl scripts to provide defaults for GPDB-added catalog columns anymore, nor the "catullus.pl" script that was used to generate pg_proc.h contents for GPDB-added functions. That's all gone, and all GPDB-specific changes are made to the .dat files directly.

    • PostgreSQL v12 removed support for OIDs in user tables, and changed all catalog tables with OIDs so that the OID is a regular column. A lot of GPDB specific code and tests to deal with OIDs was removed with that.

    • A new Kerberos test suite was added to PostgreSQL v11. We had one in GPDB already, under src/test/gpkerberos. It's not clear how much overlap there is between the two, we could possibly get rid of the GPDB test suite now, but for now, keep them both.

    • Support for passing around MemTuples in TupleTableSlots was removed. MemTuple is still the native format of AO (row) tables, but is deformed and stored in a virtual tuple slot as soon as it's read from disk. MemTuple is no longer used as the on-wire format in Motions either, nor as the on-disk format for temp files when sorting; we now use upstream MinimalTuple for those.

    • HeapTuple->t_tableOid has been resurrected. We had removed it in previous GPDB versions for performance reasons, but I don't believe that the overhead is really noticeable, at least not in recent PostgreSQL versions.

    • The pg_rewind regression tests were previously implemented in GPDB as a shell script. It was done that way when we backported the upstream tests but didn't have the TAP test framework yet. Now we do, so replace the shell implementation with the upstream TAP tests.

    • The way ALTER TABLE commands are dispatched was refactored. Some ALTER TABLE subcommands need to dispatch some extra information from QD to QEs. Previously, the ATExec* functions modified the sub-commands in the QD, and the modified subcommands were dispatched to the QEs. The problem with this approach is that the QEs try to run the ATPrep* phase on the already- prepped commands, so we had to hack the ATPrep* functions to cope with that. That was causing merge conflicts, which could be fixed, but I was not happy with the diff footprint of that approach. Instead, dispatch the AlteredTableInfo structs that were constructed in the QD's prep phase. That way, the execute phase can modifiy the sub-commands and those modifications are dispatched.

    • Upstream isolation tests (src/test/isolation) are now run in utility mode. Instead of disabling or heavily modifying isolation upstream test better to run them in utility mode (kind of single node postgres) and get the code coverage. Still doesn't test GPDB clustered behavior. Need to enhance the framework for it to understand locking across segments and also work with global deadlock detector.

    • SRFs in target lists. We adoped the new upstream node type, ProjectSet, to deal with SRFs in target lists. As a performance optimization, there were some GPDB changes in the planner earlier, to always use a Result node if there were SRFs in the target list, and the SRF-in-targetlist support was removed from other executor nodes. That was pretty much the same changes that were made in upstream, except that upstream uses ProjectSet instead of Result, and the behavior was changed to be more sane when you had multiple SRFs. Replace all the GPDB changes with upstream. Due to the behavior changes with multiple SRFs, some INSERT statements in GPDB tests had to modified so that they generate the same test data as before.

    Co-authored-by: Abhijit Subramanya [email protected] Co-authored-by: Adam Lee [email protected] Co-authored-by: Alexandra Wang [email protected] Co-authored-by: Andrey Borodin [email protected] Co-authored-by: Ashwin Agrawal [email protected] Co-authored-by: Asim Praveen [email protected] Co-authored-by: Chris Hajas [email protected] Co-authored-by: Daniel Gustafsson [email protected] Co-authored-by: David Kimura [email protected] Co-authored-by: Georgios Kokolatos [email protected] Co-authored-by: Heikki Linnakangas [email protected] Co-authored-by: Hubert Zhang [email protected] Co-authored-by: Jesse Zhang [email protected] Co-authored-by: Ning Yu [email protected] Co-authored-by: Pengzhou Tang [email protected] Co-authored-by: Soumyadeep Chakraborty [email protected] Co-authored-by: Weinan Wang [email protected] Co-authored-by: Zhenghua Lyu [email protected]