Transaction: Indivisible unit of work that either succeeds completely or fails completely.
✨ Side learning: Transactions are not locks
ACID properties
- Atomicity: All or nothing. Either all the operations in a transaction succeed, or none of them do.
- Consistency: The database is always in a valid state. This is the most weakly defined property, because it's the only property that is controller by the user and not by the database itself.
- Isolation: Multiple transactions that are running concurrently should not interfere with each other (as if no other transaction is running at the same time).
- Durability: Once a transaction has completed successfully, the changes it has made to the database are permanent. This allows the database to recover from failures.
A bunch of different components come together to implement transactions. These are:
- Transaction manager: Coordinates, schedules, and tracks transactions.
- Lock manager: Prevents concurrent access to data. If the lock is exclusive, then only one transaction can access the data at a time, and other transactions have to wait for the lock to be released. If the lock is shared, then multiple transactions can read the data at the same time.
- Page cache: Intermediary between the database's storage engine and the persistent storage (disk). It manages caching of pages in memory.
- Log manager: Writes all the changes to the database to a log file. This is used for recovery in case of a crash.
Goal is to reduce the number of accesses to the disk, because disk access is slow. For this, pages are cached in memory.
Virtual Disk: A read from virtual disk accesses physical storage (disk) only if the page is not in the page cache. If the page is in the buffer pool, then it is read from there.
Virtual Disk / Page Cache / Buffer Pool: All these terms are used interchangeably. We will use the term page cache for the rest of the notes.
Some other phrases that are widely used:
- "Paged in": When a page is read from the disk (because it is not in cache), it is said to be paged in.
- Dirty page: A page that has been modified in memory but not yet written to disk.
- "flushed to disk": When a page is written to disk, it is said to be flushed to disk.
- "evicted": When a page is removed from the page cache, it is said to be evicted.
When storage engine requests the page:
- If the page is in the page cache, then it is returned.
- If the page is not in the page cache, then it is read from the disk and returned.
Until the storage engine hands over the page to the page cache, the page is said to be "refenced". Once the page is handed over to the page cache, the page is said to be "unreferenced".
The page cache can be told to avoid evicting a page by "pinning" it.
Page cache has a limited capacity. When the page cache is full, it needs to evict some pages to make space for new pages.
- Dirty pages need to be flushed to disk before they can be evicted.
- Referenced pages cannot be evicted while some thread is still using them.
Postgres has a background flush writer that cycles over the dirty pages that are likely to be evicted, and flushes them to disk. This is because triggering a flush on every eviction might be bad for performance.
Durability: If the database crashes, all the dirty pages that have not been flushed to disk are lost. To ensure durability, the checkpoint process coordinates the page cache and the write-ahead log (WAL).
Which log records can be discarded from WAL? Only the ones which correspond to operations applied to cached pages that have been flushed to disk. This is because the changes are already on disk, so they can be recovered from there in the case of a crash.
Pinning: Locking pages in the cache so that they cannot be evicted.
Which pages would we want to pin? The ones that we know we will need to access again soon.
In the context of B-Trees, we would want to pin the pages that are on the higher-levels of the B-Tree, because they are more likely to be accessed again soon. The higher-levels have exponentially lesser nodes than the lower-levels, so it is feasible to cache them. In the case of splits or merges, the changes might propagate to the higher-levels, so having the higher-levels in cache is useful and help reduce the number of disk accesses.
Another good aspect here is that multiple writes to the same page can be batched together, and the page can be flushed to disk only once. For eaxmple, if 10 splits or merges propagate to the root node, then all of them can be batched together and the page can be flushed to disk only once.
Prefetching and Immediate Eviction
Prefetching: In some cases, when a page is read from disk, we can already tell which pages would be likely to be accessed soon. Example, we can load the next leaf nodes during a range scan.
Immediate Eviction: In some cases, we can evict a page from the cache as soon as it has been read from disk. Example, if a maintenance process loads the page, it's unlikely that the page would be useful for any future queries, so it can be evicted immediately.
Page replacement: When the page cache is full, and a new page needs to be loaded, then some existing page needs to be evicted to make space for the new page.
Page replacement policy / Eviction policy: The algorithm that decides which page to evict.
In an ideal world, we would want to evict the pages that would not be touched for the longest time in the future. But, this is not possible to predict.
Belady's Anomaly: Using larger cache might not always reduce the number of evictions. Belady's anomaly shows that increasing the number of pages might still increase the number of evictions in some cases, if the eviction policy is not good.
We discuss some eviction policies below.
FIFO == First In First Out LRU == Least Recently Used
FIFO eviction policy: Evict the page that was loaded first. LRU eviction policy: Evict the page that was accessed least recently.
LRU-K eviction policy: Evict the page that was accessed least recently, but taking only the last K accesses into account.
Clock sweep algorithm holds references to pages and an "access bit" in a circular buffer.
The clock hand moves around the clock, and the page that it points to is a candidate for eviction.
- If access bit is 1, and the page is unreferenced, then the access bit is set to 0 and the clock hand moves to the next page.
- If access bit is 0, then the page becomes a candidate and is scheduled for eviction.
- If the page is currently referenced, don't change the access bit.
Access Bit | Is Referenced | Action |
---|---|---|
1 | Unreferenced | Set access bit to 0 and move the clock hand to the next page |
0 | - | Page becomes a candidate for eviction and is scheduled for eviction |
- | Referenced | Do not change the access bit |
LFU == Least Frequently Used
TinyLFU is a frequency based page eviction policy that uses this.
Write-Ahead-Log (WAL) (also known as commit log) == Append-only auxiliary disk-resident structure used for crash and recovery.
WAL functionalities:
- Allows page cache to buffer updates to disk-resident pages.
- Persist all operations on disk until the caches copied of pages are flushed to disk.
- Allow lost in-memory changes to be reconstructed from the log in the case of failures.
With the above functionalities, it plays an important role in transaction processing. In the case of crash, uncommited data is replayed from the log and the pre-crash state is restored.
WAL consists of a sequence of log records. Each log record contains a unique, monotonically increasing LSN (log-sequence-number) that identifies the log record.
Contents of log records are cached in the log buffer, and are flushed to disk in a force
operation. Flushing happens in the sequence of LSNs.
WAL also holds records that indicate the start and end of a transaction. A transaction cannot be considered committed until the log is forced to the LSN of the commit record.
Compensation Log Records (CLRs): To be able to recover from crashes during rollback or recovery, we use CLR. Good link explaining CLRs: https://dba.stackexchange.com/a/299446
Checkpoints
Checkpoints are used to reduce the amount of work that needs to be done during recovery. Sync checkpoint is a process to force all dirty pages to be flushed to disk.
But, flushing the entire contents on disk is impractical. So, we use fuzzy checkpoints. A fuzzy checkpoint begins with a begin_checkpoint
record, and ends with an end_checkpoint
record, and contains information about the dirty pages and transaction. Once pages are flushed (asynchronously), the last_checkpoint
pointer is updated to point to the begin_checkpoint
record.
Shadow paging: It's a copy-on-write technique to ensure transaction atomicity by placing new contents in a new unpublished "shadow page", which are made visible by toggling a pointer (to point it to the new page).
We can use 2 types of logs:
- Physical log: Stores complete page state.
- Logical log: Stores only the operations.
Practically, databases use a combination of both:
- Physical logging to perform a redo (since that improves recovery time)
- Logical logging to perform an undo (since that improves concurrency and performance)
Databases define steal/no-steal and force/no-force policies to determine "when" the in-memory changes are flushed to disk.
Steal Policy:
- Steal policy: Allows dirty pages to be flushed to disk before the transaction commits.
- No-steal policy: Dirty pages are not allowed to be flushed to disk before the transaction commits.
- "Stealing" a dirty page means flushing it to disk and replacing it with a different page.
Force Policy:
- Force policy: Requires dirty pages to be flushed to disk "IMMEDIATELY BEFORE" the transaction commits.
- No-force policy: Allows transaction to commit even if the dirty pages have not been flushed to disk.
- "Forcing" a dirty page means flushing it to disk before the commit.
I found the explanation in the book quite confusing. I found a much better explanation here that cleared things for me: https://stackoverflow.com/a/37861999
Why are we discussing these policies in the context of recovery?
- Using no-steal policy allows implementing recovery using only redo entries.
- Using no-force policy allows us to buffer several updates to pages in the page cache, and flush them to disk in a single batch. This improves performance.
General rule of thumb:
- If any pages touched by a transaction are flushed, we need to keep "undo" information in the log until it commits, so that we can use that to roll the transaction back.
- If any pages touched by a transaction are not flushed, we need to keep "redo" information in the log until it commits, so that we can use that to reach the pre-crash state.
ARIES == Algorithm for Recovery and Isolation Exploiting Semantics
It is a steal/no-force recovery algorithm.
It uses:
- physical redo to improve performance during recovery.
- logical undo to improve performance during normal operation (since they can be applied to pages independently).
- WAL records to implement repeating history during recovery.
When database restarts after the crash, recovery happens in 3 phases:
- Analysis Phase: Identify the dirty pages and the transactions that were active at the time of the crash.
- Redo Phase: Repeats the history and restores the databse to the state it was in at the time of the crash. This phase is done for both committed and uncommitted transactions, whose contents were not flushed to disk.
- Undo Phase: Rolls back the uncommitted transactions in reverse chronological order.
3 types of techniques to handle concurrent transactions:
- Optimistic concurrency control (OCC): Transactions do not block each other. If execution results in conflict, one of the transactions is aborted.
- Pessimistic concurrency control (PCC): Prevents conflicts by locking data.
- Multi-version concurrency control (MVCC): Allows multiple versions of the same data to exist at the same time.
✨ Side learning:
I found the above explanation from the book quite confusing. I found a much better explanation here that cleared things for me: https://stackoverflow.com/a/39269085
TLDR of the above link:
- Optimistic Concurrency Control: This approach assumes that conflicting operations do not happen frequently (that is, it's optimistic!). This is why it allows transactions to proceed without locking the data. If a conflict happens, then one of the transactions is aborted.
- Pessimistic Concurrency Control: This approach assumes that conflicting operations happen frequently (that's why it's called pessimistic). This is why it locks the data before allowing transactions to proceed.
- Multi-version concurrency control (MVCC): This is a concurrency control method that typically belong in the category of Optimistic Concurrency Control. It is theoretically possible to combine it with pessimistic methods (like locking), but its multi-version nature is best combined with optimistic methods.
Schedule == List of operations required to execute a set of transactions from database's perspective.
"Complete" schedule: A schedule that contains all the operations of all the transactions.
"Correct" schedule: If the schedule is logically equivalent to the original lists of operations, but their parts can be executed in parallel or reordered.
"Serial" schedule: A schedule in which the operations of each transaction are executed consecutively without interleaving. This significantly reduces the concurrency, hence reducing throughput.
Serializability: A schedule is serializable if it is equivalent to "some" serial schedule.
For example, if we have 3 transactions T1, T2, and T3, then the following are the possible serial schedules:
- T1, T2, T3
- T1, T3, T2
- T2, T1, T3
- T2, T3, T1
- T3, T1, T2
- T3, T2, T1
"Isolation level" defines the degree to which the operations of a transaction are visible to other transactions.
3 types of read anomalies:
- Dirty read: A transaction reads a value that has not yet been committed.
- Non-repeatable read: A transaction reads a value, and then reads the same value again, but gets a different value the second time.
- Phantom read: A transaction reads a set of rows that satisfy a search condition, and then repeats the search, but the set of rows that satisfy the search condition has changed in the meantime. It is similar to non-repeatable read, but it's for range queries.
3 types of write anomalies:
- Lost update: A transaction overwrites the changes made by another transaction. Hence, the updates made by the other transaction are lost.
- Dirty write: A transaction reads an uncommitted value (that is, dirty read), modifies it, and then commits it. Hence, transaction results are based on uncommitted data.
- Write skew: When each individual transaction respects the invariant, but the invariant is violated when the transactions are considered together.
Isolation Level | Dirty | Non-Repeatable | Phantom |
---|---|---|---|
Read Uncommitted | ✅ | ✅ | ✅ |
Read Committed | ❌ | ✅ | ✅ |
Repeatable Read | ❌ | ❌ | ✅ |
Serializable | ❌ | ❌ | ❌ |
✅ means that the anomaly is allowed for that isolation level.
❌ means that the anomaly is not allowed for that isolation level.
Snapshot isolation allows transactions to read data as it existed at the start of the transaction. Each transaction takes a snapshot of the database at the start of the transaction, and reads from that snapshot for the rest of the transaction. The transaction commits only if the values it has modified have not been modified by any other transaction. If the values have been modified, then the transaction is aborted and rolled back.
This makes "lost update" anomaly impossible, because the transaction is aborted if the values it has modified have been modified by another transaction.
However, a "write skew" anomaly is still possible.
Optimistic concurrency control (OCC) assumes that conflicting operations do not happen frequently.
3 phases of transaction execution:
- Read Phase: Transaction reads the data it needs. After this step, all transaction dependencies (read set) and the output (write set) are known.
- Validation Phase: Determines whether committing the transaction preserves ACID properties.
- Write Phase: If validation phase has been successful, transaction commits the "write set" to the database.
"Validation" is done by checking for conflicts with other transactions.
- Backward oriented: Checks conflicts with transactions that have already been committed.
- Forward oriented: Checks conflicts with transactions that are in the validation phase.
MVCC distinguishes between committed and uncommitted versions of the data. It allows multiple versions of the same data to exist at the same time.
Major use case for MVCC is snapshot isolation.
This assumes that conflicting operations happen frequently, hence is more conservative than optimistic concurrency control.
Timestamp ordering
This is one of the simplest pessimistic lock-free concurrency control schemes.
The transaction manager maintains max_read_timestamp
and max_write_timestamp
for each value.
When a transaction reads a value, max_read_timestamp
is updated. When a transaction writes a value, it sets max_write_timestamp
is updated.
- Read operations that try to read a value with a timestamp that is less than
max_write_timestamp
are aborted, since the value has been modified by a transaction that started after the current transaction. - Write operations that try to write a value with a timestamp that is less than
max_read_timestamp
are aborted, since the value has been read by a transaction that started after the current transaction.
Lock-based concurrency control is a pessimistic concurrency control scheme that uses locks on database objects (instead of trying to resolve schedules).
Two-phase locking: This separates the locking phase from the unlocking phase.
- Locking / Growing / Expanding phase: Acquire all the locks that are needed by the transaction. No locks are released.
- Unlocking / Shrinking phase: Release all the locks that were acquired by the transaction during the growing phase. No locks are acquired.
Deadlocks: A situation when two or more transactions are waiting for each other to release the locks. Example, T1 has locked A and is waiting for B, and T2 has locked B and is waiting for A. So both are forever waiting for each other to release the locks, hence are said to be "in deadlock".
How to handle deadlocks? --- Introduce timeouts and abort long-running transactions.
How to detect deadlocks? --- Using wait-for graphs.
- This is a directed graph where each transaction is a node, and there is an edge from T1 to T2 if T1 is waiting for T2 to release the lock. If there is a cycle in the graph, then there is a deadlock.
- Deadlock detection can be done periodically, or everytime the wait-for graph is updated.
How to avoid deadlocks? --- By restricting lock acquisition. Transaction manager can use timestamps to determine priority order of transactions. A lower timestamp means higher priority.
If T1 attempts to acquire a lock that is held by T2, and T1 started before T2 (that is, T1 has higher priority), then we can use one of the following:
- Wait-die: T1 waits for T2 to release the lock.
- Wound-wait: T1 aborts T2 and acquires the lock.
- Locks are used to isolate and schedule transactions and manage database contents, NOT the internal storage structure itself.
- Locks are acquired on keys.
- Locks are generally stored and managed outside of the tree implementation.
- Locks are more heavyweight than latches.
- Locks are held for the duration of the transaction.
- Latches guard the internal storage structure, that is, the physical representation.
- Latches are acquired on pages.
- Even the lockless concurrency control techniques have to use latches.
- Any page has to be latched for safe concurrent access.
Example use-case: Modifications on leaf-level can propagate to higher levels of B-Tree. In such cases, latches are obtained on multiple levels. This is to ensure that executing queries on the higher levels does not result in inconsistent data (such as incomplete writes or partial node splits).
Readers-Writer Lock (RW Lock) is a lock that allows concurrent (shared) read access, but exclusive write access.
Latch crabbing (also called latch coupling) is an optimisation technique to minimise the time for which a latch is held. This is especially useful when we have a simple implementation for latches, where we obtain latches on all the pages from the root to the target leaf node.
Latch crabbing allows:
- Holding latches for less time
- Releasing them when the executing operation doesn't need them anymore
When can the latch be released?
- On a read path, the latch on parent node can be released once the child node has been located.
- On an insert path, the latch on parent node can be released if the operation is not going to modify the parent node (via splits).
- On a delete path, the latch on parent node can be released if the operation is not going to modify the parent node (via merges).
Latch upgrading: Instead of acquiring exclusive locks during traversal, we can acquire shared locks and then upgrade them to exclusive locks when needed.
For example: Write operations on leaf nodes require exclusive locks. But, we can acquire shared locks on the parent nodes, and then upgrade them to exclusive locks "if needed" (due to splits or merges).
If multiple threads are trying to upgrade the same latch, then only one of them can succeed. The other threads have to wait or restart.
Pointer chasing: Every request goes through the root node, so we can always latch the root node optimistically. Price of "retry" (pointer chasing) is very rarely paid for root nodes since splits/merges rarely propagate to the root node.
Blink-Trees allow a state called "half-split" where a node is referenced by sibling pointer, but not by the parent's child pointer.
Advantage: We don't need to hold parent lock when descending to child level, even if it's going to lead to a split. The new node can be made visible through the sibling link, and we can let the parent pointer update lazily without sacrificing correctness. This prevents the search path from being blocked by the parent node, causing unnecessary restarts, as all elements in the tree are accessible.
- In the "Log Semantics" section (Page 91), it's mentioned that "Pages are flushed asynchronously and, once this is done, the last_checkpoint record is updated with the LSN of the begin_checkpoint record and, in case of a crash, the recovery process will start from there"
Shouldn't the "last_checkpoint" record be updated with the LSN of the "end_checkpoint" record?
- In "Latch Crabbing" section, it was mentioned that "If the child page is still not loaded in the page cache, we can either latch a future loading page, or release a parent latch and restart the root-to-leaf pass after the page is loaded to reduce contention."
What does it mean?
- In "Latch Upgrading and Pointer Chasing" section, it was mentioned that "The root is always the last to be split, since all of its children have to fill up first. This means that the root node can always be latched optimistically, and the price of a retry (pointer chasing) is seldom paid."
Does "the root node can always be latched optimistically" mean that all threads can be allowed to acquire a "shared latch" on the root node? If not, what does it mean?
- Franck Pachot posted a blog: Isolation Levels in Modern SQL Databases
- Can't wait to read WAL and Lock related content in PostgreSQL 14 Internals book.
This section contains anything worth mentioning that came up as part of the weekly reading group discussion. I will update it when I am adding notes for the next chapter.