TL;DR Of A Curious Moon

The idea of the repository is to have a streamlined version of the teachings I had per chapter¹, focused on the Postgres itself. I found it interesting to change the database to avoid a simple copy/paste by myself. I chose a public dataset from the Brazilian government regarding confiscated cigarettes (and destroyed)², that I may (or may not) change in the future.

A Curious Moon by Rob Conery

Before the book

In this repository we will use docker-compose to run a local postgres instance, to start run:

docker compose up --remove-orphans [-d]
# or
sh scripts/00_start.sh

To use the psql CLI³:

sh scripts/10_psql.sh
# or directly commands
sh scripts/10_psql.sh <<< 'select 1;'
sh scripts/10_psql.sh < my_sql.sql

Transit

Respect the ETL (extract, transform, load)
Before a create must have a drop * if exists [cascade]
Create a schema just for importing the data with text type
Put everything in an idempotent script (same input same output)

In Orbit

Create lookup tables: reduce repetitions and could speedup search. A new table with all distinct values from the given column (of the import.* table), with a primary key (that will be used as foreign key)
Date in Postgres is stored as a UTC
TIMESTAMPT: values in UTC
TIMESTAMPTZ: converts TIMESTAMP values (UTC) to the client's session time zone
make clean && make

Flyby

A query is Sargable(Search ARGument ABLE) if it can take advantage of an index to speed up the execution of the query
Avoid LIKE %* as it is non-sargable
ILIKE is case-insensitive

A Bent Field

Create a view to make querying easier (and drop if exists)
\x: toggle expanded display (table vs record by record)
\H: change output to html layout
\o file_name: change output to a file
Full-text indexing: to_tsvector(my_text_column) and search @@ to_tsquery('my_search_query')
GIN: Generalized Inverted Index (key, posting list)
Materialized View: a view does not store data, a Materialized View is stored in disk (and so, allowed to create an index). Once the original data is updated, it is necessary to update the materialized view: REFRESH MATERIALIZED VIEW my_view_name

Sniff The Sky

Don't burn yourself out
Count how many files follow the pattern: ls */*.txt | wc -l
csvkit: command-line tools for converting to and working with CSV, such as:
- in2csv: convert * to CSV
- csvcut -n: print column names
- csvsql: import into PostgreSQL (remember to replace varchar to text)

csvsql my_csv_path -i postgresql --tables "import.my_name" --noconstraints –overwrite | sed 's/VARCHAR/text/g' > import.sql

Using cut command to retrieve the desired columns: cut -d ';' -f 1-2,4-6 user_data/cigarros.csv
CTE (Common Table Expression): with temp_table_name as ( select ...), other_name as (select ... where temp_table_name.x = 2) select * from other_name

with count_events as (select count(*) from events),
     count_regions as (select count(*) from regions)
select * from count_events, count_regions;

Ring Dust

PostgreSQL support the following languages: PL/pgSQL, PL/Tcl, PL/Perl, and PL/Python.

drop function if exists my_function(arguments_types);
create function my_function (arguments)
returns return_datatype
IMMUTABLE -- if pure function (no side effects, same input same output)
as $$
   declare
      ...
   begin
      ...
      return my_return_value
   end;
$$ language sql;

Or similar to the one presented in the book

drop function if exists max_price_per_region(int);
create function max_price_per_region(rid int, out numeric)
as $$
    select max(price) from events where events.region_id=rid
    -- implicity return the first row of this select
$$ language sql;

To other languages, such as python, it is necessary to run CREATE LANGUAGE [plpython3u] that requires installing python itself along with Postgres.

Window function: similar to aggregation without grouping rows count(1) over (partition by ...) (you may want to add distinct in after the select)

as example, the percentage of entries per year with max/min price per year.

select distinct
    year,
    round(100.0 * (count(1) over (partition by year)::numeric
            / (count(1) over())::numeric), 2) as percentage_of_entries,
    (max(price) over (partition by year)::numeric
            / (count(1) over())::numeric) as max_price,
    (min(price) over (partition by year)::numeric
            / (count(1) over())::numeric) as min_price
from events;

A Tight Ship

Chapter dedicated to explaining 'why to choose postgres'
PostgreSQL License: liberal Open Source license, similar to the BSD or MIT licenses. You can distribute, modify and commercial use.
Price per hour comparison on AWS-RDS with db.t3.xlarge (16GB, 4vCPU): $0.60 oracle vs $0.29 Postgres
MVCC (Multi-Version Concurrency Control): locks acquired for querying (reading) data don't conflict with locks acquired for writing data. (reading doesn't blocks writing, and writing doesn't blocks reading).
For writing: the data is lifted into virtual memory, only when the transaction is complete is the entire change written to disk in one operation.

Gravity Assist

You (unfortunately) may add bias to the data (but avoid it)
Occam's razor: the simplest explanation is usually the best one
The Temporary table is automatically dropped at the end of a session / current transaction CREATE TEMPORARY TABLE my_temp_table_name

Under The Ice

Use explain [analyze] my_query to shows the execution plan of a query, it is important to check its cost and its scans
The cost is a relative measurement, which you may want to minimize.
There are several ways to scan a table such as sequential scan, index scan, ...
\d my_element describes a element (table, view, index, ...)
In a many-to-many (N:M) relationship, a junction table is used to support relationship between two tables. The junction table contains the two foreign keys and creates its own primary key using those two foreign keys.
In postgres is possible to index with where (e.g.: index where a column is not null), and also concurrently (without paying)
BTREE index (balanced tree)
Also the tsrange (timestamp range) in which you define a window (start, end) and inclusive/exclusive options.
Other range options: int4range, numrange, daterange ...
Ranges Types: Your Life Will Never Be The Same
To query ranges, use @> e.g: SELECT int4range(10, 30) @> 20, to compute the intersection between two ranges use *, and to check if the ranges overlaps, use &&
Remember to create a new index using GIST (Generalized Search Tree) on the ranges (if you wish)
However, if you are using a range to constrain results it may not help (as Dee problem)

Glide Path and Epilog

Time for plot
Overall, I loved this book! I never thought a tutorial could be so well done
Please, feel free to contribute to more lessons :D

Some notes here were not directly found in the book, but the book led me to research about. ↩
Data From: https://dados.gov.br/dados/conjuntos-dados/destinacoes-de-mercadorias-apreendidas ↩
I would like to thank to Giovanni for the psql script. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
scripts		scripts
sql		sql
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TL;DR Of A Curious Moon

Before the book

Transit

In Orbit

Flyby

A Bent Field

Sniff The Sky

Ring Dust

A Tight Ship

Gravity Assist

Under The Ice

Glide Path and Epilog

About

Languages

CarlosPena00/tldr_a_curious_moon

Folders and files

Latest commit

History

Repository files navigation

TL;DR Of A Curious Moon

Before the book

Transit

In Orbit

Flyby

A Bent Field

Sniff The Sky

Ring Dust

A Tight Ship

Gravity Assist

Under The Ice

Glide Path and Epilog

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Languages