Skip to content

The idea of the repository is to have a streamlined version of the teachings I had per chapter of a curious moon, focused on the Postgres itself.

Notifications You must be signed in to change notification settings

CarlosPena00/tldr_a_curious_moon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TL;DR Of A Curious Moon

The idea of the repository is to have a streamlined version of the teachings I had per chapter1, focused on the Postgres itself. I found it interesting to change the database to avoid a simple copy/paste by myself. I chose a public dataset from the Brazilian government regarding confiscated cigarettes (and destroyed)2, that I may (or may not) change in the future.

A Curious Moon by Rob Conery

Before the book

In this repository we will use docker-compose to run a local postgres instance, to start run:

docker compose up --remove-orphans [-d]
# or
sh scripts/00_start.sh

To use the psql CLI3:

sh scripts/10_psql.sh
# or directly commands
sh scripts/10_psql.sh <<< 'select 1;'
sh scripts/10_psql.sh < my_sql.sql

Transit

  • Respect the ETL (extract, transform, load)
  • Before a create must have a drop * if exists [cascade]
  • Create a schema just for importing the data with text type
  • Put everything in an idempotent script (same input same output)

In Orbit

  • Create lookup tables: reduce repetitions and could speedup search. A new table with all distinct values from the given column (of the import.* table), with a primary key (that will be used as foreign key)
  • Date in Postgres is stored as a UTC
  • TIMESTAMPT: values in UTC
  • TIMESTAMPTZ: converts TIMESTAMP values (UTC) to the client's session time zone
  • make clean && make

Flyby

  • A query is Sargable(Search ARGument ABLE) if it can take advantage of an index to speed up the execution of the query
  • Avoid LIKE %* as it is non-sargable
  • ILIKE is case-insensitive

A Bent Field

  • Create a view to make querying easier (and drop if exists)
  • \x: toggle expanded display (table vs record by record)
  • \H: change output to html layout
  • \o file_name: change output to a file
  • Full-text indexing: to_tsvector(my_text_column) and search @@ to_tsquery('my_search_query')
  • GIN: Generalized Inverted Index (key, posting list)
  • Materialized View: a view does not store data, a Materialized View is stored in disk (and so, allowed to create an index). Once the original data is updated, it is necessary to update the materialized view: REFRESH MATERIALIZED VIEW my_view_name

Sniff The Sky

  • Don't burn yourself out
  • Count how many files follow the pattern: ls */*.txt | wc -l
  • csvkit: command-line tools for converting to and working with CSV, such as:
    • in2csv: convert * to CSV
    • csvcut -n: print column names
    • csvsql: import into PostgreSQL (remember to replace varchar to text)

csvsql my_csv_path -i postgresql --tables "import.my_name" --noconstraints –overwrite | sed 's/VARCHAR/text/g' > import.sql

  • Using cut command to retrieve the desired columns: cut -d ';' -f 1-2,4-6 user_data/cigarros.csv
  • CTE (Common Table Expression): with temp_table_name as ( select ...), other_name as (select ... where temp_table_name.x = 2) select * from other_name
with count_events as (select count(*) from events),
     count_regions as (select count(*) from regions)
select * from count_events, count_regions;

Ring Dust

  • PostgreSQL support the following languages: PL/pgSQL, PL/Tcl, PL/Perl, and PL/Python.
drop function if exists my_function(arguments_types);
create function my_function (arguments)
returns return_datatype
IMMUTABLE -- if pure function (no side effects, same input same output)
as $$
   declare
      ...
   begin
      ...
      return my_return_value
   end;
$$ language sql;

Or similar to the one presented in the book

drop function if exists max_price_per_region(int);
create function max_price_per_region(rid int, out numeric)
as $$
    select max(price) from events where events.region_id=rid
    -- implicity return the first row of this select
$$ language sql;

To other languages, such as python, it is necessary to run CREATE LANGUAGE [plpython3u] that requires installing python itself along with Postgres.

  • Window function: similar to aggregation without grouping rows count(1) over (partition by ...) (you may want to add distinct in after the select)

as example, the percentage of entries per year with max/min price per year.

select distinct
    year,
    round(100.0 * (count(1) over (partition by year)::numeric
            / (count(1) over())::numeric), 2) as percentage_of_entries,
    (max(price) over (partition by year)::numeric
            / (count(1) over())::numeric) as max_price,
    (min(price) over (partition by year)::numeric
            / (count(1) over())::numeric) as min_price
from events;

A Tight Ship

  • Chapter dedicated to explaining 'why to choose postgres'
  • PostgreSQL License: liberal Open Source license, similar to the BSD or MIT licenses. You can distribute, modify and commercial use.
  • Price per hour comparison on AWS-RDS with db.t3.xlarge (16GB, 4vCPU): $0.60 oracle vs $0.29 Postgres
  • MVCC (Multi-Version Concurrency Control): locks acquired for querying (reading) data don't conflict with locks acquired for writing data. (reading doesn't blocks writing, and writing doesn't blocks reading).
  • For writing: the data is lifted into virtual memory, only when the transaction is complete is the entire change written to disk in one operation.

Gravity Assist

  • You (unfortunately) may add bias to the data (but avoid it)
  • Occam's razor: the simplest explanation is usually the best one
  • The Temporary table is automatically dropped at the end of a session / current transaction CREATE TEMPORARY TABLE my_temp_table_name

Under The Ice

  • Use explain [analyze] my_query to shows the execution plan of a query, it is important to check its cost and its scans
  • The cost is a relative measurement, which you may want to minimize.
  • There are several ways to scan a table such as sequential scan, index scan, ...
  • \d my_element describes a element (table, view, index, ...)
  • In a many-to-many (N:M) relationship, a junction table is used to support relationship between two tables. The junction table contains the two foreign keys and creates its own primary key using those two foreign keys.
  • In postgres is possible to index with where (e.g.: index where a column is not null), and also concurrently (without paying)
  • BTREE index (balanced tree)
  • Also the tsrange (timestamp range) in which you define a window (start, end) and inclusive/exclusive options.
  • Other range options: int4range, numrange, daterange ...
  • Ranges Types: Your Life Will Never Be The Same
  • To query ranges, use @> e.g: SELECT int4range(10, 30) @> 20, to compute the intersection between two ranges use *, and to check if the ranges overlaps, use &&
  • Remember to create a new index using GIST (Generalized Search Tree) on the ranges (if you wish)
  • However, if you are using a range to constrain results it may not help (as Dee problem)

Glide Path and Epilog

  • Time for plot
  • Overall, I loved this book! I never thought a tutorial could be so well done
  • Please, feel free to contribute to more lessons :D

Footnotes

  1. Some notes here were not directly found in the book, but the book led me to research about.

  2. Data From: https://dados.gov.br/dados/conjuntos-dados/destinacoes-de-mercadorias-apreendidas

  3. I would like to thank to Giovanni for the psql script.

About

The idea of the repository is to have a streamlined version of the teachings I had per chapter of a curious moon, focused on the Postgres itself.

Topics

Resources

Stars

Watchers

Forks