- Run PostgreSQL and pgAdmin
- Create sample schema with data
- Query plan
- Seq Scan
- Index selectivity
- B-Tree indexes
- Multicolumn indexes
- Unique indexes
- Partial indexes
- Expression indexes
- GIN indexes
- Hash indexes
- Create indexes on foreign keys
- Use more joins
- Don't over-index
- Keep statistics updated
- Detect slow queries
- Cluster a table according to an index
- Use table partitioning
This example covers the most important topics related to PostgreSQL performance.
-
Make sure Docker and Docker Compose are installed and up to date
-
Run
docker-compose up -d
, wait for it to initialize completely -
Login to pgAdmin 4 at
http://localhost:8080
as[email protected]/s3cr3t
-
Connect to
Local PostgreqSQL
asadmin/s3cr3t
-
Create schema by executing DDL statements from
tables.sql
-
Create test data by executing DML statements from
data.sql
-
Generate more test data
INSERT INTO book (isbn, title, publication_date, rating) SELECT SUBSTR(MD5(RANDOM()::TEXT), 0, 14), MD5(RANDOM()::TEXT), DATE '2010-01-01' + CAST(RANDOM() * (DATE '2020-01-01' - DATE '2010-01-01') AS INT), ROUND((1 + RANDOM() * 4)::numeric, 3) FROM generate_series(1, 100000);
PostgreSQL has planner/optimizer that creates an optimal execution plan.
To understand how to optimize an SQL query we need to know its execution plan.
EXPLAIN ANALYZE
followed by SELECT ...
, UPDATE ...
, or DELETE ...
,
executes the statement and provides a query plan with details about the execution.
EXPLAIN ANALYZE
SELECT * FROM book;
QUERY PLAN
----------
"Seq Scan on book (cost=0.00..3667.07 rows=100007 width=89) (actual time=0.003..6.143 rows=100007 loops=1)"
"Planning Time: 0.100 ms"
"Execution Time: 8.622 ms"
A query plan shows what type of scanning was used for the query:
- Seq Scan
- Bitmap Index Scan and Bitmap Heap Scan
- Index Scan
- Index Only Scan
To display a query plan as a diagram with an additional information in pgAdmin 4 use EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)
.
- https://www.postgresql.org/docs/13/planner-optimizer.html
- https://www.postgresql.org/docs/13/sql-explain.html
Seq Scan is a full table scan and always reads everything in the table. It scans through every page of data sequentially.
-
Seq Scan is efficient when a large proportion of the table is retrieved from the table
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b
QUERY PLAN ---------- "Seq Scan on book b (cost=0.00..3667.07 rows=100007 width=57) (actual time=0.008..13.369 rows=100007 loops=1)" "Planning Time: 0.035 ms" "Execution Time: 15.806 ms"
-
Seq Scan can filter rows while reading them
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.publication_date = '1994-11-10';
QUERY PLAN ---------- "Seq Scan on book b (cost=0.00..3917.09 rows=27 width=57) (actual time=0.008..8.575 rows=1 loops=1)" " Filter: (publication_date = '1994-11-10'::date)" " Rows Removed by Filter: 100006" "Planning Time: 0.056 ms" "Execution Time: 8.592 ms"
-
When a small number of rows is returned by the query a scan that uses index is more efficient. If there in no suitable index for the query, Seq Scan is the only available option.
Create indexes if often less than 15% of the table's rows are going to be retrieved.
-
Reset statistics counters to zeros
SELECT pg_stat_reset();
-
Let the system work for some time after resetting statistics.
-
Get the suggestion what tables need an index by looking at
seq_scan
andseq_tup_read
. These columns show what tables were used in sequential scansSELECT schemaname, relname as table_name, seq_scan, -- Number of sequential scans initiated on this table seq_tup_read, -- Number of live rows fetched by sequential scans idx_scan, -- Number of index scans initiated on this table idx_tup_fetch -- Number of live rows fetched by index scans FROM pg_stat_user_tables WHERE seq_scan > 0 ORDER BY seq_tup_read DESC;
The number of distinct values in the indexed column divided by the number of records in the table is called a selectivity of an index.
Selectivity is one of factors influencing type of scanning the planner/optimezer will use for a query (Index Scan or Bitmap Index Scan).
- Calculate the selectivity of the single column indexes you want to create
SELECT ROUND(COUNT(DISTINCT rating)::NUMERIC / COUNT(*), 2) AS selectivity FROM BOOK;
selectivity ----------- 0.04
SELECT ROUND(COUNT(DISTINCT publication_date)::NUMERIC / count(*), 2) AS selectivity FROM BOOK;
selectivity ----------- 0.04
- The best selectivity is 1. Only unique indexes on NOT NULL columns guaranteed to have such selectivity.
SELECT ROUND(COUNT(DISTINCT isbn)::NUMERIC / COUNT(*), 2) AS idx_selectivity FROM BOOK;
selectivity ----------- 1
Prefer indexing columns with selectivity greater than > 0.85.
Index Scan uses an index to find rows matching a predicate. It finds each row in the index and then reads the actual data from the table.
-
Create a single column B-Tree index
CREATE INDEX idx_book_title ON book (title);
-
Execute the query
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.title = 'Patterns of Enterprise Application Architecture';
QUERY PLAN ---------- "Index Scan using idx_book_title on book b (cost=0.42..8.44 rows=1 width=57) (actual time=0.195..0.197 rows=1 loops=1)" " Index Cond: ((title)::text = 'Patterns of Enterprise Application Architecture'::text)" "Planning Time: 0.340 ms" "Execution Time: 0.209 ms"
-
Index Scan was used because index
idx_book_title
has good selectivity (1). -
If there is an additional predicate in the query (unindexed columns), the Index Scan can filter rows while reading them, just like a sequential scan
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.title = 'Patterns of Enterprise Application Architecture' AND b.rating > 4.5; --unindexed column
QUERY PLAN ---------- "Index Scan using idx_book_title on book b (cost=0.42..8.44 rows=1 width=57) (actual time=0.013..0.013 rows=0 loops=1)" " Index Cond: ((title)::text = 'Patterns of Enterprise Application Architecture'::text)" " Filter: (rating > 4.5)" " Rows Removed by Filter: 1" "Planning Time: 0.303 ms" "Execution Time: 0.026 ms"
-
Postgres will switch to a Seq Scan instead Index Scan when a large proportion of the table (approximately more than 15% of the table's rows) is retrieved from the table (not selective predicate).
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.publication_date > '2012-01-01';
QUERY PLAN ---------- "Seq Scan on book b (cost=0.00..3917.09 rows=80034 width=57) (actual time=0.012..16.944 rows=79900 loops=1)" " Filter: (publication_date > '2012-01-01'::date)" " Rows Removed by Filter: 20107" "Planning Time: 0.204 ms" "Execution Time: 19.364 ms"
If index selectivity is bad (approximately less than < 0.85), the planner/optimizer will use Bitmap Scan instead of Index Scan.
Bitmap Scan is in the middle between a Seq Scan and an Index Scan. It is useful when you need a lot of rows from a table and these rows are located in different pages (blocks).
Bitmap Scans always consists of minimum 2 nodes. There is Bitmap Index Scan at the bottom and then Bitmap Heap Scan.
-
Create index on acolumns with bad selectivity
CREATE INDEX idx_book_pub_date ON book (publication_date);
-
Execute the query
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.publication_date > '2019-01-01';
QUERY PLAN ---------- "Bitmap Heap Scan on book b (cost=116.61..2906.71 rows=9848 width=57) (actual time=1.485..5.062 rows=9869 loops=1)" " Recheck Cond: (publication_date > '2019-01-01'::date)" " Heap Blocks: exact=1334" " -> Bitmap Index Scan on idx_book_pub_date (cost=0.00..114.15 rows=9848 width=0) (actual time=1.324..1.326 rows=9869 loops=1)" " Index Cond: (publication_date > '2019-01-01'::date)" "Planning Time: 0.409 ms" "Execution Time: 5.430 ms"
-
Bitmap Scan was used because index
idx_book_pub_date
has bad selectivity (0.04) and thus the query returns a lot of rows. -
Bitmap scans are capable of combining multiple indexes using bitmap AND (&) and OR (|).
CREATE INDEX idx_book_rating ON book (rating);
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.publication_date > '2019-01-01' AND b.rating > 4.9;
QUERY PLAN ---------- "Bitmap Heap Scan on book b (cost=166.48..927.55 rows=256 width=57) (actual time=0.537..0.858 rows=253 loops=1)" " Recheck Cond: ((rating > 4.9) AND (publication_date > '2019-01-01'::date))" " Heap Blocks: exact=232" " -> BitmapAnd (cost=166.48..166.48 rows=256 width=0) (actual time=0.515..0.515 rows=0 loops=1)" " -> Bitmap Index Scan on idx_book_rating (cost=0.00..51.95 rows=2604 width=0) (actual time=0.194..0.194 rows=2547 loops=1)" " Index Cond: (rating > 4.9)" " -> Bitmap Index Scan on idx_book_pub_date (cost=0.00..114.15 rows=9848 width=0) (actual time=0.281..0.281 rows=9869 loops=1)" " Index Cond: (publication_date > '2019-01-01'::date)" "Planning Time: 0.074 ms" "Execution Time: 0.881 ms"
-
Sometimes two or more columns with poor selectivity can be combined to form a multicolumn index with good selectivity
SELECT ROUND(( SELECT COUNT(*) AS count_distinct FROM ( SELECT DISTINCT publication_date, rating FROM book ) AS t)::NUMERIC / COUNT(*), 2) AS selectivity FROM book;
selectivity ----------- 1.00
-
Drop the single column indexes
DROP INDEX idx_book_pub_date, idx_book_rating;
-
Create a multicolumn index
CREATE INDEX idx_book_pub_date_rating ON book (publication_date, rating);
-
Execute the query
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.publication_date > '2019-01-01' AND b.rating > 4.9;
QUERY PLAN ---------- "Bitmap Heap Scan on book b (cost=254.96..1016.03 rows=256 width=57) (actual time=0.658..0.896 rows=253 loops=1)" " Recheck Cond: ((publication_date > '2019-01-01'::date) AND (rating > 4.9))" " Heap Blocks: exact=232" " -> Bitmap Index Scan on idx_book_pub_date_rating (cost=0.00..254.90 rows=256 width=0) (actual time=0.592..0.592 rows=253 loops=1)" " Index Cond: ((publication_date > '2019-01-01'::date) AND (rating > 4.9))" "Planning Time: 0.074 ms" "Execution Time: 0.922 ms"
-
The order of predicates in a query is not important. The planner/optimizer will use the index
idx_book_pub_date_rating
for bothWHERE b.publication_date > $1 AND b.rating > $2
andWHERE b.rating > $2 AND b.publication_date > $1
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.rating > 4.9 AND b.publication_date > '2019-01-01';
QUERY PLAN ---------- "Bitmap Heap Scan on book b (cost=254.96..1016.03 rows=256 width=57) (actual time=0.732..1.010 rows=253 loops=1)" " Recheck Cond: ((publication_date > '2019-01-01'::date) AND (rating > 4.9))" " Heap Blocks: exact=232" " -> Bitmap Index Scan on idx_book_pub_date_rating (cost=0.00..254.90 rows=256 width=0) (actual time=0.694..0.694 rows=253 loops=1)" " Index Cond: ((publication_date > '2019-01-01'::date) AND (rating > 4.9))" "Planning Time: 0.102 ms" "Execution Time: 1.044 ms"
-
The multicolumn index will be also used in queries referencing only the left part of the indexed columns in the
WHERE
clause, e.g.WHERE b.publication_date > $1
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.publication_date > '2019-01-01';
QUERY PLAN ---------- "Bitmap Heap Scan on book b (cost=232.74..3022.84 rows=9848 width=57) (actual time=0.592..4.204 rows=9869 loops=1)" " Recheck Cond: (publication_date > '2019-01-01'::date)" " Heap Blocks: exact=1334" " -> Bitmap Index Scan on idx_book_pub_date_rating (cost=0.00..230.28 rows=9848 width=0) (actual time=0.461..0.462 rows=9869 loops=1)" " Index Cond: (publication_date > '2019-01-01'::date)" "Planning Time: 0.054 ms" "Execution Time: 4.449 ms"
-
However, the multicolumn index will not be used in queries referencing only the right part of the indexed columns in the
WHERE
clause, e.g.WHERE b.rating > 4
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.rating > 4;
QUERY PLAN ---------- "Seq Scan on book b (cost=0.00..3917.09 rows=24873 width=57) (actual time=0.011..21.136 rows=24998 loops=1)" " Filter: (rating > '4'::numeric)" " Rows Removed by Filter: 75009" "Planning Time: 0.118 ms" "Execution Time: 21.976 ms"
-
The multicolumn index will not be used in queries where predicates are combined with
OR
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.rating > 4.9 OR b.publication_date > '2019-01-01';
QUERY PLAN ---------- "Seq Scan on book b (cost=0.00..4167.10 rows=12196 width=57) (actual time=0.752..15.237 rows=12163 loops=1)" " Filter: ((rating > 4.9) OR (publication_date > '2019-01-01'::date))" " Rows Removed by Filter: 87844" "Planning Time: 0.067 ms" "Execution Time: 15.564 ms"
Index Only Scan fetches data directly from the index without reading the table data entirely.
Index Only Scan is the most efficient type of scanning.
To use Index Only Scan the query must select only columns included in the index
EXPLAIN ANALYZE
SELECT b.publication_date,
b.rating
FROM book b
WHERE b.publication_date = '1994-11-10';
QUERY PLAN
----------
"Index Only Scan using idx_book_pub_date_rating on book b (cost=0.42..4.89 rows=27 width=10) (actual time=0.383..0.386 rows=1 loops=1)"
" Index Cond: (publication_date = '1994-11-10'::date)"
" Heap Fetches: 0"
"Planning Time: 0.186 ms"
"Execution Time: 0.472 ms"
A unique index guarantees that the table column values won't have duplicates.
Manually unique index can be defined as
CREATE UNIQUE INDEX book_isbn_key ON book (isbn);
PostgreSQL automatically creates a unique index when a unique constraint or primary key is defined for a table.
The column isbn
of the book
tables has unique constrained (isbn VARCHAR(14) UNIQUE
), so the planner/optimezer will use automatically created unique index book_isbn_key
for the query
EXPLAIN ANALYZE
SELECT b.isbn,
b.title,
b.publication_date,
b.rating
FROM book b
WHERE b.isbn = '978-1449373320';
QUERY PLAN
----------
"Index Scan using book_isbn_key on book b (cost=0.42..8.44 rows=1 width=57) (actual time=0.111..0.111 rows=1 loops=1)"
" Index Cond: ((isbn)::text = '978-1449373320'::text)"
"Planning Time: 0.063 ms"
"Execution Time: 0.124 ms"
A partial index is an index with a WHERE
clause.
Only rows matching the supplied predicated are indexed. Use partial indexes to exclude rows from an index that are not likely to be queried.
-
Create the partial index
CREATE INDEX idx_book_pub_date_rating_part on book (publication_date) WHERE rating > 4;
-
The partial index will be used by queries with predicate
b.rating > 4
likeEXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.publication_date > '2019-01-01' AND b.rating > 4;
QUERY PLAN ---------- "Bitmap Heap Scan on book b (cost=35.27..2791.45 rows=2449 width=57) (actual time=0.268..1.500 rows=2436 loops=1)" " Recheck Cond: ((publication_date > '2019-01-01'::date) AND (rating > '4'::numeric))" " Heap Blocks: exact=1130" " -> Bitmap Index Scan on idx_book_pub_date_rating_part (cost=0.00..34.66 rows=2449 width=0) (actual time=0.166..0.166 rows=2436 loops=1)" " Index Cond: (publication_date > '2019-01-01'::date)" "Planning Time: 0.128 ms" "Execution Time: 1.574 ms"
-
The partial index will not be used by queries with predicate
b.rating > $1
and$1 != 4
likeEXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.publication_date > '2019-01-01' AND b.rating > 3;
QUERY PLAN ---------- "Bitmap Heap Scan on book b (cost=115.39..2930.11 rows=4953 width=57) (actual time=0.557..4.687 rows=4979 loops=1)" " Recheck Cond: (publication_date > '2019-01-01'::date)" " Filter: (rating > '3'::numeric)" " Rows Removed by Filter: 4890" " Heap Blocks: exact=1334" " -> Bitmap Index Scan on idx_book_pub_date (cost=0.00..114.15 rows=9848 width=0) (actual time=0.410..0.411 rows=9869 loops=1)" " Index Cond: (publication_date > '2019-01-01'::date)" "Planning Time: 0.106 ms" "Execution Time: 4.857 ms"
Expression indexes are useful for queries using function in the WHERE
clause.
-
Sequential scan is used foth the query that match on
LOWER
function (lowercase)EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE LOWER(b.title) = 'patterns of enterprise application architecture';
QUERY PLAN ---------- "Seq Scan on book b (cost=0.00..4167.10 rows=500 width=57) (actual time=0.175..82.824 rows=1 loops=1)" " Filter: (lower((title)::text) = 'patterns of enterprise application architecture'::text)" " Rows Removed by Filter: 100006" "Planning Time: 0.128 ms" "Execution Time: 82.846 ms"
-
Create an expression index
CREATE INDEX idx_book_lower_title on book (LOWER(title));
-
Repeat the
SELECT
queryQUERY PLAN ---------- "Bitmap Heap Scan on book b (cost=20.29..1290.40 rows=500 width=57) (actual time=0.189..0.190 rows=1 loops=1)" " Recheck Cond: (lower((title)::text) = 'patterns of enterprise application architecture'::text)" " Heap Blocks: exact=1" " -> Bitmap Index Scan on idx_book_lower_title (cost=0.00..20.17 rows=500 width=0) (actual time=0.183..0.183 rows=1 loops=1)" " Index Cond: (lower((title)::text) = 'patterns of enterprise application architecture'::text)" "Planning Time: 0.526 ms" "Execution Time: 0.212 ms"
GIN indexes are "inverted indexes". An inverted index contains a separate entry for each component value, and can efficiently handle queries that test for the presence of specific component values.
-
B-Tree index is not used in queries with
LIKE
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.title LIKE 'Patterns%';
QUERY PLAN ---------- "Seq Scan on book b (cost=0.00..3917.09 rows=10 width=57) (actual time=0.017..18.060 rows=1 loops=1)" " Filter: ((title)::text ~~ 'Patterns%'::text)" " Rows Removed by Filter: 100006" "Planning Time: 0.733 ms" "Execution Time: 18.096 ms"
-
To use a trigram GIN indexes, create an extension
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-
Create the trigtram GIN index
CREATE INDEX idx_book_title_trgm ON book USING gin (title gin_trgm_ops);
-
Repeat the
SELECT
queryQUERY PLAN ---------- "Bitmap Heap Scan on book b (cost=100.08..138.37 rows=10 width=57) (actual time=0.123..0.125 rows=1 loops=1)" " Recheck Cond: ((title)::text ~~ 'Patterns%'::text)" " Rows Removed by Index Recheck: 3" " Heap Blocks: exact=1" " -> Bitmap Index Scan on idx_book_title_trgm (cost=0.00..100.08 rows=10 width=0) (actual time=0.112..0.113 rows=4 loops=1)" " Index Cond: ((title)::text ~~ 'Patterns%'::text)" "Planning Time: 0.509 ms" "Execution Time: 0.167 ms"
-
GIN index can be combined with other indexes using bitmap operations
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.publication_date > '2019-01-01' AND b.title LIKE 'a%';
QUERY PLAN ---------- "Bitmap Heap Scan on book b (cost=184.16..1618.23 rows=597 width=57) (actual time=1.916..3.272 rows=621 loops=1)" " Recheck Cond: (((title)::text ~~ 'a%'::text) AND (publication_date > '2019-01-01'::date))" " Heap Blocks: exact=505" " -> BitmapAnd (cost=184.16..184.16 rows=597 width=0) (actual time=1.772..1.773 rows=0 loops=1)" " -> Bitmap Index Scan on idx_book_title_trgm (cost=0.00..69.46 rows=6061 width=0) (actual time=1.077..1.077 rows=6209 loops=1)" " Index Cond: ((title)::text ~~ 'a%'::text)" " -> Bitmap Index Scan on idx_book_pub_date (cost=0.00..114.15 rows=9848 width=0) (actual time=0.599..0.599 rows=9869 loops=1)" " Index Cond: (publication_date > '2019-01-01'::date)" "Planning Time: 0.182 ms" "Execution Time: 3.334 ms"
Hash index is a flat structure unlike B-Tree.
Hash indexes can only handle simple equality comparisons (using the =
operator)
The main advantage of Hash indexes over B-Tree is space. On a very large data sets Hash indexes takes less space compared to B-Tree and allow pretty fast lookups.
Only B-Tree indexes can be unique, hash indexes doesn't support this feature.
To use Hash indexes PostgreSQL 10+ is required.
-
Drop B-Tree index
idx_book_title
DROP INDEX idx_book_title
-
Create Hash index
CREATE INDEX idx_book_title_hash ON book USING HASH (title);
-
Execute the query
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b WHERE b.title = 'Patterns of Enterprise Application Architecture'
QUERY PLAN ---------- "Index Scan using idx_book_title_hash on book b (cost=0.00..8.02 rows=1 width=57) (actual time=0.029..0.029 rows=1 loops=1)" " Index Cond: ((title)::text = 'Patterns of Enterprise Application Architecture'::text)" "Planning Time: 0.122 ms" "Execution Time: 0.043 ms"
Make sure every foreign key has a matching index.
There are few exceptions when index on foreign key is unnecessary:
- if the table with the foreign key is samll because sequential scan will be probably cheaper,
- if you will never join tables on this key,
- if you will never delete a row or update a key column in the referenced table.
-
Execute the query without indexing the foreign keys
EXPLAIN ANALYZE SELECT b.isbn, b.title, b.publication_date, b.rating FROM book b JOIN publisher p ON p.publisher_id = b.publisher_id WHERE b.publication_date = '1994-11-10';
QUERY PLAN ---------- "Hash Join (cost=5.57..105.83 rows=27 width=57) (actual time=0.753..0.760 rows=1 loops=1)" " Hash Cond: (b.publisher_id = p.publisher_id)" " -> Bitmap Heap Scan on book b (cost=4.50..104.69 rows=27 width=73) (actual time=0.456..0.459 rows=1 loops=1)" " Recheck Cond: (publication_date = '1994-11-10'::date)" " Heap Blocks: exact=1" " -> Bitmap Index Scan on idx_book_pub_date (cost=0.00..4.50 rows=27 width=0) (actual time=0.436..0.437 rows=1 loops=1)" " Index Cond: (publication_date = '1994-11-10'::date)" " -> Hash (cost=1.03..1.03 rows=3 width=16) (actual time=0.124..0.126 rows=3 loops=1)" " Buckets: 1024 Batches: 1 Memory Usage: 9kB" " -> Seq Scan on publisher p (cost=0.00..1.03 rows=3 width=16) (actual time=0.114..0.116 rows=3 loops=1)" "Planning Time: 1.978 ms" "Execution Time: 0.926 ms"
-
Create the index matching the foreign key
CREATE INDEX idx_fk_book_publisher ON book (publisher_id);
-
Repeat the
SELECT
query and compare query plansQUERY PLAN ---------- "Nested Loop (cost=0.29..26.36 rows=27 width=57) (actual time=0.177..0.193 rows=1 loops=1)" " -> Seq Scan on publisher p (cost=0.00..1.03 rows=3 width=16) (actual time=0.008..0.010 rows=3 loops=1)" " -> Index Scan using idx_fk_book_publisher on book b (cost=0.29..8.43 rows=1 width=73) (actual time=0.057..0.058 rows=0 loops=3)" " Index Cond: (publisher_id = p.publisher_id)" " Filter: (publication_date = '1994-11-10'::date)" " Rows Removed by Filter: 2" "Planning Time: 0.720 ms" "Execution Time: 0.226 ms"
-
Drop the index
idx_fk_book_publisher
DROP INDEX idx_fk_book_publisher;
-
Run the following query to find all unindexed foreign key constraints
SELECT c.conname AS fk_constraint_name, c.conrelid::REGCLASS AS table, c.confrelid::REGCLASS AS referenced_table, ARRAY_AGG(a.attname) AS columns, FORMAT('CREATE INDEX ON %s (%s);', c.conrelid::regclass, STRING_AGG(a.attname, ',')) as sql FROM pg_constraint c JOIN pg_attribute a ON a.attrelid = c.conrelid AND a.attnum = any(c.conkey) WHERE c.contype = 'f' AND NOT EXISTS ( SELECT 1 FROM pg_index i WHERE i.indrelid = c.conrelid AND conkey <@ STRING_TO_ARRAY(indkey::text, ' ')::SMALLINT[]) GROUP BY c.conname, c.conrelid, c.confrelid;
fk_constraint_name table referenced_table columns sql ------------------------------------------------------------------------- "book_publisher_id_fkey" "book" "publisher" "{publisher_id}" "CREATE INDEX ON book (publisher_id);"
-
Take the values of the
sql
column form the result set and execute to create the missing indexes.
PostgreSQL is good at joining multiple tables. Queries with joins are usually better than subqueries. A query with 5 joins are completely acceptable.
EXPLAIN ANALYZE
SELECT b.isbn,
b.title,
b.publication_date,
a.full_name,
c.name
FROM book b
JOIN book_author ba
ON b.book_id = ba.book_id
JOIN author a
ON a.author_id = ba.author_id
JOIN book_category bc
ON b.book_id = bc.book_id
JOIN category c
ON c.category_id = bc.category_id
WHERE b.title LIKE '%Patterns%';
Maintaining a lot of indexes has its price.
Indexes consume disk space.
The more indexes you have, the slower INSERT
statements will become.
Make sure you don't have an unused indexes.
-
Reset statistics counters to zeros
SELECT pg_stat_reset();
-
Let the system work for some time after resetting statistics.
-
Find indexes that have never been used
SELECT s.schemaname, s.relname AS table_name, s.indexrelname AS index_name, s.idx_scan AS times_used, pg_size_pretty(pg_relation_size(t.relid)) AS table_size, pg_size_pretty(pg_relation_size(s.indexrelid)) AS index_size, idx.indexdef AS index_ddl FROM pg_stat_user_indexes s JOIN pg_stat_user_tables t ON s.relname = t.relname JOIN pg_index i ON s.indexrelid = i.indexrelid JOIN pg_indexes AS idx ON s.indexrelname = idx.indexname AND s.schemaname = idx.schemaname WHERE s.idx_scan = 0 -- no scans AND 0 <> ALL(i.indkey) -- 0 in the array means this is an expression index AND NOT i.indisunique -- no unique index ORDER BY pg_relation_size(s.indexrelid) DESC;
The planner/optimizer relies on statistics about the contents of tables in order to generate good plans for queries.
Make sure to run VACUUM ANALYZE
to keep statistics up to date and recover or reuse disk space occupied by updated or deleted rows.
PostgreSQL has autovacuum daemon automating the execution of VACUUM
and ANALYZE
commands.
It is highly recommended to enable autovacuum feature.
Check when the tables was last vacuumed and analyzed
SELECT schemaname,
relname,
last_vacuum,
vacuum_count,
last_analyze,
analyze_count
FROM pg_stat_user_tables;
To detect slow queries pg_stat_statements
extension is required.
It has to be preloaded using shared_preload_libraries=pg_stat_statements
.
-
Check that
pg_stat_statements
was preloadedSHOW shared_preload_libraries;
shared_preload_libraries ------------------------ "pg_stat_statements"
-
Create extension for tracking planning and execution statistics of all executed SQL statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-
Get 20 slowest SQL queries
SELECT SUBSTRING(query, 1, 40) AS short_query, ROUND(( 100 * total_exec_time / SUM(total_exec_time) OVER ())::NUMERIC, 2) AS percent, ROUND(total_exec_time::numeric, 2) AS total_exec_time, calls, ROUND(mean_exec_time::numeric, 2) AS mean, query FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;
When a table is clustered, it is physically reordered based on the index information.
PostgreSQL Documentation:
In cases where you are accessing single rows randomly within a table, the actual order of the data in the table is unimportant. However, if you tend to access some data more than others, and there is an index that groups them together, you will benefit from using CLUSTER. If you are requesting a range of indexed values from a table, or a single indexed value that has multiple rows that match, CLUSTER will help because once the index identifies the table page for the first row that matches, all other rows that match are probably already on the same table page, and so you save disk accesses and speed up the query.
A table can be clustered according to an index using
CLUSTER book USING book_isbn_key;
Clustering is a one-time operation: when the table is subsequently updated, the changes are not clustered.
Partitioning refers to splitting what is logically one large table into smaller physical pieces.
Partitioning benefits:
-
Partitioning helps to scale PostgreSQL by splitting large logical tables into smaller physical tables that can be stored on different storage media based on uses;
-
Dividing a large table into smaller tables reduces table scans and memory swap problems and thus increases performance;
-
Partitioning reduces index size and makes it more likely that the heavily-used parts of the indexes fit in memory.
You will get benefits from partitioning only when a table is very large. Very large means that the table is at least larger than the physical memory of the database server.
To use declarative partition PostgreSQL 10+ is required.
-
Make sure partition pruning is enabled. Partition pruning is a query optimization technique that improves performance for declaratively partitioned tables.
SHOW enable_partition_pruning;
enable_partition_pruning ------------------------ on
-
Drop all tables by executing DDL statements from
drop-all.sql
-
Create a table
book
that will be partitioned bypublication_date
columnCREATE EXTENSION IF NOT EXISTS "uuid-ossp"; CREATE TABLE book ( book_id UUID DEFAULT uuid_generate_v4(), isbn VARCHAR(14), title VARCHAR(255) NOT NULL, publication_date DATE NOT NULL, rating NUMERIC(4, 3), PRIMARY KEY (book_id, publication_date), UNIQUE (isbn, publication_date) ) PARTITION BY RANGE (publication_date);
-
Notice: primary key and unique constraint on partitioned table must include all partitioning columns.
-
Create a tablespace for the partition with the most recent data. Tablespace allows to define locations in the file system where the files representing database objects can be stored. This way partitions can be stored on different storage media.
CREATE TABLESPACE fasttablespace LOCATION '/ssd1/postgresql/data/fasttablespace';
-
Create partitions for the data from the past in the default tablespace
CREATE TABLE book_y1990 PARTITION OF book FOR VALUES FROM ('1990-01-01') TO ('2000-01-01'); CREATE TABLE book_y2000 PARTITION OF book FOR VALUES FROM ('2000-01-01') TO ('2010-01-01'); CREATE TABLE book_y2010 PARTITION OF book FOR VALUES FROM ('2010-01-01') TO ('2020-01-01');
-
Create partition for the most recent data in the
fasttablespace
tablespace located on different storage mediaCREATE TABLE book_y2020 PARTITION OF book FOR VALUES FROM ('2020-01-01') TO ('2030-01-01') TABLESPACE fasttablespace;
-
Create an index on the key column. This automatically creates one index on each partition.
CREATE INDEX idx_book_part_key ON book (publication_date);
-
Generate some test data
INSERT INTO book (isbn, title, publication_date, rating) SELECT SUBSTR(MD5(RANDOM()::TEXT), 0, 14), MD5(RANDOM()::TEXT), DATE '2010-01-01' + CAST(RANDOM() * (DATE '2021-01-01' - DATE '2010-01-01') AS INT), ROUND((1 + RANDOM() * 4)::numeric, 3) FROM generate_series(1, 100000);
-
Depending on a predicate, only a single partition (smaller index) can be queried, improving query performance
EXPLAIN ANALYZE SELECT EXTRACT(YEAR FROM b.publication_date) AS pub_year, COUNT(*) FROM book b WHERE b.publication_date > '2020-01-01' -- Only book_y2020 will be queried GROUP BY pub_year;
QUERY PLAN ---------- "HashAggregate (cost=294.23..297.23 rows=200 width=16) (actual time=4.898..4.900 rows=2 loops=1)" " Group Key: date_part('year'::text, (b.publication_date)::timestamp without time zone)" " Batches: 1 Memory Usage: 40kB" " -> Index Only Scan using book_y2020_publication_date_idx on book_y2020 b (cost=0.29..248.79 rows=9089 width=8) (actual time=0.021..2.606 rows=9092 loops=1)" " Index Cond: (publication_date > '2020-01-01'::date)" " Heap Fetches: 0" "Planning Time: 0.346 ms" "Execution Time: 5.112 ms"
-
Depending on a predicate a, multiple partitions can be queried in parallel and combined into a single result set
EXPLAIN ANALYZE SELECT EXTRACT(YEAR FROM b.publication_date) AS pub_year, COUNT(*) FROM book b WHERE b.publication_date > '2010-01-01' -- Both book_y2010 and book_y2020 will be queried GROUP BY pub_year;
QUERY PLAN ---------- "Finalize GroupAggregate (cost=3455.77..3507.44 rows=200 width=16) (actual time=20.181..21.293 rows=12 loops=1)" " Group Key: (date_part('year'::text, (b.publication_date)::timestamp without time zone))" " -> Gather Merge (cost=3455.77..3502.44 rows=400 width=16) (actual time=20.174..21.282 rows=33 loops=1)" " Workers Planned: 2" " Workers Launched: 2" " -> Sort (cost=2455.75..2456.25 rows=200 width=16) (actual time=17.239..17.241 rows=11 loops=3)" " Sort Key: (date_part('year'::text, (b.publication_date)::timestamp without time zone))" " Sort Method: quicksort Memory: 25kB" " Worker 0: Sort Method: quicksort Memory: 25kB" " Worker 1: Sort Method: quicksort Memory: 25kB" " -> Partial HashAggregate (cost=2445.11..2448.11 rows=200 width=16) (actual time=17.210..17.213 rows=11 loops=3)" " Group Key: (date_part('year'::text, (b.publication_date)::timestamp without time zone))" " Batches: 1 Memory Usage: 40kB" " Worker 0: Batches: 1 Memory Usage: 40kB" " Worker 1: Batches: 1 Memory Usage: 40kB" " -> Parallel Append (cost=0.29..2236.82 rows=41657 width=8) (actual time=0.031..10.669 rows=33327 loops=3)" " -> Parallel Index Only Scan using book_y2010_publication_date_idx on book_y2010 b_1 (cost=0.29..1835.51 rows=53449 width=8) (actual time=0.024..7.661 rows=30290 loops=3)" " Index Cond: (publication_date > '2010-01-01'::date)" " Heap Fetches: 0" " -> Parallel Index Only Scan using book_y2020_publication_date_idx on book_y2020 b_2 (cost=0.29..193.03 rows=5360 width=8) (actual time=0.027..1.176 rows=4556 loops=2)" " Index Cond: (publication_date > '2010-01-01'::date)" " Heap Fetches: 0" "Planning Time: 0.173 ms" "Execution Time: 21.418 ms"