- UNIX is a general-purpose, multi-user, interactive OS
- A hierarchical file system with demountable volumes
- compatible file, device, and inter-process I/O
- support async process
- command language per user
- characteristics are simple, elegant, easy-to-use
- Three kinds of files: ordinary disk file, directories, and special files
Ordinary file
- a file contains whatever information the user places on it, no particular pattern is expected
Directory
- A directory is a mapping between names of files to the files themselves, thus induce a structure as a whole
- A directory behaves just like an ordinary file, except that unprivileged programs cannot write
- The system uses a root directory, and a command directory
- Linking: different names for the same file under different directories
- A file does not exist in a particular directory, but only a pointer to the file is included in a directory
- Directory structure is restrained to be a rooted tree, where a directory must only be in one another directory, which is its parent.
Special Files
- Each I/O device is associated with at least one special file
- Write and read ops to special files are exactly the same as an ordinary file, but they result in activation of the associated device
- Three advantages:
- Files and device I/O are as similar as possible
- File and device names have the same syntax and meaning, can be used as parameters
- Special files have the same protection as ordinary files
Removable File Systems
- A removable volume which has the structure of an independent file system containing its own directory hierarchy
- mount can be used to replace a leaf of the hierarchy tree (an ordinary file) by a whole new subtree
- The treatment of a mounted file system is the same as the ordinary file system
- One exception: no link may exist between two file system hierarchies
Protection
- 7 bits:
- 6 independently specify read, write, execute permission for the owner of the file and all other users
- 7th bit specifies if the user id is temporarily changed to the owner of the file during the file's execution
I/O Calls
- No distinction between random and sequential access
- No logical record size imposed by the system
- open(name, flag) -> returns a file descriptor (an int)
- create -> create a new one or truncate an old one to zeros
- No user-visible locks, no restrictions on the # of users having a file opened
- ==View locks as neither necessary nor sufficient to prevent interference between users of the same file==
- Unnecessary: not a large, single-file database where multiple processes operate on
- Insufficient: ordinary locks cannot prevent confusion when multiple users operate on the same file, where each makes a copy of the file being edited
- Read and write read/write a sequential of data from/to file
- seek -> moves pointer to a random spot, then read/write
- A directory entry contains the name of the file and a pointer (an int)
- The pointer is an integer called i-number of the file
- The i-number is an index of the file into a system table (the i-list)
- The entry found is called i-node of the file, containing metadata of the file: owner, protection bits, address for the file contents, size, time of last modification, # of links to the file, dir bit, special file bit, large or small bit
- When a new file is created, an i-node is allocated for it and a directory entry is made from the path to it
- Removing a path is done by decrementing the i-node's link counter, and removing directory entry
- disk is divided into 512-bytes blocks
- Read and write seem to be synchronous and unbuffered, but they are actually buffered, where content are checked first if they are in a buffer, if not, then fetch from disk. Still, writes are async as the write call returns right after value is written in the buffer.
- Image: a computer execution environment, which specifies the current state of a pseudo computer. It includes a core image, general register values, status of open files, current directory, etc.
- Process: a process is the execution of an image. While the processor is executing on behalf of a process, its image is in core
- User-core part of the image is divided into three logical segments:
- text segment: read-only, begins at location 0
- heap: write/read, above the program text segment, non-shared, data segment
- stack: starts at the highest address, automatically grows downward as the hardware's stack pointer fluctuates
Processes
- a new process is created using fork
- parent/child have exactly the same original core image
- use return id to identify themselves
Pipes
- pipe -> returns a file descriptor, creates an interprocess channel called pipe
- pipe is passed in fork call
- read/write are used to pass data between the images of the two processes
Execute
- execute -> requests the system to read in and execute the program
- all the code and data in the process using execute is replaced from the file, but open files, current directories, and interprocess relationships are unaltered
Process synchronization
- wait -> caller suspends execution until one of its children has completed execution
Termination
- exit -> terminates a process, destroys its image, closes open files
- communication with UNIX is done through the Shell
- commands are found and brought into core and executed by the Shell
Standard I/O
- Programs executed by the Shell start off with file 1/0, which are the standard output/input files
Filters
- vertical bar | is used to pipeline outputs to next command's input
- filters can be applied
Command Separators: Multitasking
- & is used to run tasks in the background, so the command returns immediately
Implementation of the Shell
- The Shell waits for new-line char
- when received, arrange arguments and fork
- child runs execute
- parent waits the child to die, and then return its value
Initialization
- init -> creates the very first process, called as the last step in UNIX's initialization
- init creates one process for each typewriter channel which may be dialled up up by a user
- After authenticating a user, the process performs execute and wait for command input
- The mainstream path waits on the Shell process
- Hardware detects program faults, terminates the process, and writes the user's image on a file for debugging purpose
- What problem does this paper address?
- The paper addresses the problem of a general-purpose interactive OS that is simple yet powerful, allowing users to easily involve
- What is the author's main insight?
- Minimalist design, unified abstractions (everything is a file)
- Take the view that everything can be seen as a file: string of data, indexing structure, device I/O
- What are the paper's key strengths?
- Powerful, time-sharing system
- Interactive so addictive
- Easily portable as it is written in high-level language (C)
- Open-source, people help it evolve
- Instead of inventing something new, UNIX does a careful selection of mechanisms used in the system, which produces a simple, elegant, and easy-to-use OS
- What are the paper's key weaknesses?
- Fixed block size not optimal for all apps but minimizes system overhead
- Data independence: a data transparency where user applications should not be affected by the database's in-storage structure and access strategy
- Modern databases provides data independence by offering a high-level user interface through which user applications manage stored data, instead of various bits, pointers, arrays, etc.
- Relational Database properties:
- all information is represented by data values, no "connection"
- The system supports a very high-level language, user can specify query without specifying algorithms for processing
- System R is an experimental relational database that demonstrates it can support both high performance and complete function required for daily production use
- Provide a high-level non-navigational interface for maximum user productivity and data independence
- Support different database usages: programmed transactions, ad hoc queries, report generation
- Support rapidly changing database environment, where tables, indexes, views, transactions, and objects can be added and removed easily
- Support concurrency, protect integrity in a concurrent-update environment
- Database recovery to a consistent state
- Provide a flexible mechanism whereby different views of data can be managed by various users
- Support all functions above with a level of performance comparable to existing lower-function database systems
- Store relations in the form of "tuples", where a tuple store data values
- key advantage: all the data values of a record could be fetched by a single I/O
- inversion: associates domain values with the TIDs of tuples
- Optimizer algorithm:
- cost of fetching tuples
- cost of creating and manipulating TID lists
- cost of fetching the data pointed to by the tuples
- cost measured as the number of I/Os + CPU time
- cluster tuples on physical pages
- Emphasize simple interactions instead of complex queries
- System R consists of an access method (RSS) and an optimizing SQL processor (RDS), modularity design
- All locking and logging functions are in RSS
- All authorization and access path selection functions in RDS
- View subsystem: users to define alternative views of the database
- Authorization subsystem: only access the ones granted
- Recovery subsystem: restored to a consistent state after failure
Query Optimization:
- SQL statements of arbitrary complexity could be decomposed into a relatively small collection of machine-language "fragments"
- In R, SQL statements are examined, optimized, and compiled into small, efficient machine-language routings
RSS Access Paths
- index: associate indexed values to records, similar to "inversion"
- link: a pointer stored with record which connect it to other related records
- Access paths available:
- index scans: access a table associatively and scan it in value order using an index
- relation scans: scan over a table as it is laid out in physical storage
- link scans: traverse from one record to another using links
Optimizer
- Optimizer minimizes the weighted sum of cost of I/O and CPU time
Recovery Subsystem
- Disk failure
- information on disk is lost
- recover by an alternative disk
- System failure
- information in main memory is lost
- the system reverts to the old pages, use the log to undo all updates by incomplete transactions
- Transaction failure
- need to undo changes by the transaction
- revert log and undo
- takes place on-line
Locking Subsystem
- A hierarchy of locks, with several different sizes of lockable units
- Lock sizes range from individual records to several tables
- locking is transparent to end users
- When having a lock on small object, an "intentional" lock is required on its larger object
- When locks on smaller objects accumulate, they can trade for a lock on larger object
- Three levels of isolation
- level 1: may read uncommitted data. No lock acquired
- level 2: cannot read uncommitted data, but successive read can give inconsistent value. Lock when read, release immediately after read.
- level 3: guaranteed consistent successive reads return the same value. Lock held until the end of the transaction.
Compilation Approach
- The approach of compiling SQL statements into machine code is successful
- Can generate machine-language routine to execute any SQL statements
- Beneficial for short, repetitive transactions
- Small cost for database compared to writing transactions on the RSS interface
- What problem does this paper address?
- Can a relational database provide the same level of performance as a lower-level database, while providing easy-to-use, high-level language interface, data independence?
- What is the author's main insight?
- Everything is a table
- Store values only, no connections/pointers
- separate into RDS and RSS, modularity design
- Locking system where locks on individual records
- shadow page + WAL for recovery
- Compile SQL statements into machine code
- Use CPU time + # of I/O for optimization
- Indexing used for access path
- What are the paper's key strengths?
- Similar performance as lower-level database
- Allow evolvability (data independence)
- Top-down approach where easy-to-use for programmers (SQL statements)
- What are the paper's key weaknesses?
- Could be slower for simple imperative queries
- SQL is declarative programming, it is hard to express certain things in declarative form
103 - Architecture of a Database System (119 pages, [blog summary][https://blog.acolyer.org/2015/01/20/architecture-of-a-database-system/])
- Process models
- Parallel architecture
- Storage system design
- transaction system implementation
- Query processor and optimizer architectures
- Typical shared components and utilities
- Three process model options:
- Process per DBMS worker
- Thread per DBMS worker
- Process pool (a variant of process-per-worker, bounded pool of workers)
- Move data from DB to client through:
- disk I/O buffers: some shared memory
- client communication buffers: use client communications socket as a queue for the tuples it produces (for prefetching)
- lock table: shared by all DBMS workers and is used by the Lock Manager
- Log:
- generated during transaction processing
- log tail: log staged to an in-memory queue, periodically flushed to disk in FIFO order
- Admission control:
- admission control policy for multi-user systems
- Does not accept new work unless sufficient DBMS resources are available
- graceful degradation: under overload, latency increases but throughput remain at peak
- Two tiers:
- front door: limit the number of client connections
- within the core DBMS relational query processor: after the query is parsed and optimized, then determines if a query is postponed or with fewer resources
- Shared-memory parallel system:
- all processors can access the same RAM and disk with roughly the same performance
- advantage:
- simple, just like a uniprocessor DBMS
- challenge:
- how to modify the query execution layers to take advantage of the ability to parallelize a single query across multiple CPUs
- scalability
- Shared-nothing parallel system:
- a cluster of independent machines that communicate over a high-speed network interconnect or over commodity networking components
- each table is sliced horizontally, spread across the machines
- common today
- advantage:
- scalability
- challenge:
- cross-processor coordination
- partial failure
- cost
- Shared-disk parallel system:
- all processors can access the disks with the same performance, unable to access each other's RAM
- advantage:
- lower cost of administration compared to shared-nothing
- no partial failure on nodes
- challenge:
- single point of failure
- corrupt page even if RAID
- cross-processor coordination
- relational query processor: takes a declarative SQL statement, validates it, optimizes it into a procedural dataflow execution plan, and executes that dataflow program on behalf of a client program
Main Tasks for an SQL Parser:
- Check the query is correctly specified
- resolve names and references
- convert to internal format which optimizer understands
- verify the user is authorized to execute the query
Query rewrite module
- responsible for simplifying and normalizing the query without changing its semantics
- can only rely on the query and metadata, no access to tables
- e.g. view expansion, semantic optimization
Query optimizer
- transform the internal query representation into an efficient query plan for execution
- A query plan is a dataflow diagram that pipes table data through a graph of query operators
Query executor
- executes a query plan, mostly use iterator model
Versioning
- if versioning is used with timestamp, then queries of the same historical time will provide compatible answers
- no read locks needed
- DBMS has more information on its workload access patterns (compared to OS), DBMS architects should have full control over the spatial positioning of database blocks on disk
- An alternative is to create very large files in the OS file system
- DBMS should have control over when to physically written to the disk
- Throughput in a well-tuned DBSM is typically not I/O bound, but bounded by memory
ACID
- Atomicity: transaction is atomic, either commit or abort, no partial changes
- Consistency: the state is consistent before and after changes
- Isolation: transactions are not interfering each other
- Durability: data stored can survive crash failures
Implementation of ACID
- Isolation: locking protocol
- Durability: logging and recovery
- Isolation and atomicity: combination of locking and logging
- consistency: runtime checks in the query optimizer
Concurrency control enforcement
- Strict two-phase locking: acquire lock before read/write, release at the end of transaction
- Multi-version concurrency control: no locks, guaranteed a consistent view of the database state at some time in the past
- optimistic concurrency control: no locks, just read/write. keep track of values read/write, if some modified before commit, roll back one of the conflicting transactions
Isolation Levels
- Read uncommitted: dirty read allowed, no lock
- Read committed: dirty read not allowed, need a lock to read, but lock released immediately after read
- Repeatable read: need a lock to read, release lock at the end of the transaction
- serializable: as if all transactions are done by one process
Recovery
- Write-Ahead Logging (WAL)
- Each modification to a database page needs to generate a log record, and the log record needs to be flushed before the actual database page is flushed
- Log records need to be flushed in order
- when a commit request occurs, the commit log record needs to be flushed before the commit request returns successfully
Metadata format
- Use the same format for metadata as for data
Replication
- Physical replication: duplicate the entire database periodically
- Bad scaling
- Expensive
- Trigger-based replication: use triggers to store "difference" records in a special replication table
- High performance penalty for some workloads
- Log-based replication (the solution if possible): a log sniffer process intercepts log writes and delivers them to the remote system
- low performance overhead for the running system
- provides incremental updates
- use built-in log logic
- What problem does this paper address?
- A summary of DBMS design and implementation
- What is the author's main insight?
- Key components are process model, SQL parser, storage management, parallel architecture, ACID (Transactions, concurrency, recovery)
- What are the paper's key strengths?
- A detailed, comprehensive summary of academic and industry design and implementation
- What are the paper's key weaknesses?
- A survey, so none?
- Built on UNIX, written in C,
Topics covered
- The system process structure
- The embedding of all INGRES commands in C
- The access methods implemented
- The catalog structure and the role of the database administrator
- Support for views, protection, and integrity constraints
- The decomposition procedure implemented
- Implementation of updates and consistency of secondary indices
- Recovery and concurrency control
- If the function in question can completely and correctly be implemented only with the knowledge and help of end points, then providing the function as a feature of communication system itself is not possible.
- From performance perspective, lower level can put efforts to improve performance, but it doesn't have to be "perfect".
- If implemented at a lower level, it should not incur costs to services that do not need it
- What problem does this paper address?
- When there are various modules, where should we implement different functions?
- What is the author's main insight?
- Implement at end points if the function requires the knowledge of end points
- Implement at a lower level only if performance enhancement for services that require or not require such function
- What are the paper's key strengths?
- Provides theoretical basis for implementing functions at end points
- What are the paper's key weaknesses?
- It is a principle, not a ground truth. It may vary in different systems
- More flexible allocation policies: better locality, adaptive to peripheral and processor characteristics
- Cluster data
- Two block sizes
- Programmer interface enhancement: advisory locks on files, name space extension, long file names, admin control of resource
- Low throughput:
- small block size, limited read-ahead in the system, many seeks
- Disk drive:
- each disk drive contains one or more file systems
- each file system is described by its super-block, that is at the beginning of the file system's disk partition, replicated
- Block size:
- Minimum block size of 4096 bytes
- recorded in super-block
- Cylinder group:
- disk partition into one or more areas called cylinder group, each consists of one or more consecutive cylinders on a disk
- contains super-block, inodes, bit map, usage of data blocks
- bit map: describes available blocks
- bookkeeping info begins at a offset, better reliability
- data is laid out so that larger blocks can be transferred in a single disk transaction, more data transferred per disk transaction
- Divide blocks into segments to efficiently store small files
- A file can occupy several blocks and/or several segments
- allocate new blocks/segments when new data is written and already allocated space is insufficient. Copy old data in fragments over if a new block is allocated.
- wasted space for 4096/1024 system is similar to 1024 system
- a file system has a minimum free space reserve to function optimally
-
File system is parameterized to adapt
- processor capability
- storage characteristics
-
The file system allocates new blocks optimally
- On the same cylinder as the file's previous block
- rotationally well placed
- disk: number of blocks per track, rate at which the disk spins
- processor: expected time to interrupt and schedule
- Two levels:
- Global policy: use file system-wide summary information to make decisions regarding the placement of new inodes and data blocks
- Local policy: local allocation routines that use a locally optimal scheme to lay out data blocks
- Conflicting goals of localizing data for concurrent access vs. spreading out unrelated data
- Two methods to improve performance:
- Improve locality to reduce seek latency
- Improve data transfer size to reduce the number of transfers
- inode:
- inodes of files in the same dir are usually accessed together, place them in the same cylinder group
- a new dir is allocated in a cylinder group with high free inodes and small number of dir in it -> allow inode allocation policy to succeed most of the time
- allocate inodes randomly in a cylinder group: small constant disk transfers cap
- data block:
- place data blocks of a file in the same cylinder group, with rotationally optimal positions
- to prevent filling up, try to evenly place data blocks of large files
- Global policy uses partial local information to make decisions
- When local policy received a block allocation request from global:
- allocate if block is free
- if not, next available block rotationally closest
- if no block available in the cylinder, use a block in cylinder group
- if cylinder group full, hash choose another cylinder group
- if fail, exhaustive search
- inode layout policy effective -> inodes within a dir are clustered
- High utilization of disk bandwidth
- Reads and writes are faster
- blocks of a file are more optimally ordered
- Long file names
- File locking
- granularity of a file
- advisory shared or exclusive locks on files
- What problem does this paper address?
- The original UNIX file system has low throughput, what are the improvements possible?
- What is the author's main insight?
- Increase block size while considering small files
- Parameterize processor and storage for optimization
- Global and local layout policies to optimally allocate inodes and data blocks
- What are the paper's key strengths?
- High throughput
- Layout policies that consider processor and disk characteristics: locality, large data transfer size
- What are the paper's key weaknesses?
- Specialized to disks, hard to port to other medium (e.g. SSD)
Two methods for analyzing file system behaviour and evaluating file system changes:
- Semantic block-level analysis (SBA)
- combines knowledge of on-disk data structures + a trace of disk traffic to infer system behavior
- enables users to understand why the file system behaves as it does
- Semantic trace playback (STP)
- enable modifying traces of disk traffic to represent changes in the file system implementation
- enable users to measure the benefits of new policies without actually implementing it
Journaling File System
- modern file systems are journaling file system
- Write information about pending updates to a Write-Ahead Log (WAL) before committing changes to disk
- Enables fast file system recovery after a crash
Analysis of Linux ext3, ReiserFS, Linux JFS, Windows NTFS
- focus on journaling aspect
- determine the events that cause data and metadata to be written to the journal or their fixed locations
- examine how characteristics of workload and configurations affect it
Overview
- Controlled workload patterns
- Analysis of both time taken for operations + resulting stream of read and write requests below the file system
- it leverages info on block type (e.g. inode or journal)
- interposes on the block interface to storage
Metrics
- Quantity of read and write requests to disk: caching and write buffering
- Block numbers: sequential or random, where
- when read/written: burstness of traffic
Implementation
- Use SBA driver to interpose between the filesystem and disk
- track request and response by storing in a circular buffer
- SBA requires in interpret contents of disk block traffic: interpret block-level traces on-line
Workload
- Use synthetic workload that uncover decisions made by the file system
- Parameters to vary:
- Amount of data written
- Sequential vs. random access
- Interval between calling fsync
- Amount of concurrency
- only write-based workloads
Overhead of SBA
- processing and memory overheads are low
- For each I/O request, overhead is:
- get_time at start/end
- determine block number is journal or fixed-location block
- determine journal metadata or journal data
Overview
- enable developers to rapidly suggest and evaluate file system modifications without actually implementing it
- a user-level process
- takes an input as a trace, parses it, and issues I/O requests to the disk using raw disk interface. Allow threads and concurrency
Requirements
-
Observe two high-level activities:
- observe any file-system level operations that create dirty buffers in memory
- observe application-level calls to fsync: understand why write traffic happens
-
STP input: A file system level trace + SBA-generated block-level trace
Advantages
- not time-consuming as actual implementation
- no need for a detailed and accurate simulator
Limitations
- If workload has complex virtual memory behavior whose interactions are not modelled, results may not be meaningful
- Can only evaluate changes that are not too radical
- Does not provide how to implement
Overview
- Disk:
- split into block groups
- each block group: bitmaps, inode blocks, data blocks
- ext3 journal stored as a file
- Journal overview:
- info about pending updates written to the journal
- force journal updates to disk before updating file system structure
- recover by scanning through the journal and redo incomplete committed operations
- Journal treated as a circular buffer; space reclaimed once written to its fixed location in the ext2 structures
Journaling Modes
- Three modes: writeback mode, ordered mode, data journaling mode
- Writeback mode:
- only file system metadata is journaled
- data block directly written to disk
- does not enforce any ordering between journal and disk data writes
- guarantee consistent file system metadata
- no guarantee to data block consistency
- weakest semantics of the three modes
- Ordered journaling mode:
- only metadata is journaled
- data writes to disk before the journal writes of the metadata
- both data and metadata are guaranteed to be consistent after recovery
- Data Journaling mode:
- journal both data and metadata (written to disk twice)
- data and metadata written to disk after journal commit
- both data and metadata are guaranteed to be consistent after recovery
Transactions
- ext3 groups many updates into a single compound transaction that is periodically committed to disk
Journal Structure
- Use additional metadata to track the list of journaled blocks
- journal superblock: tracks summary info of the journal (e.g. block size, head and tail pointers)
- journal descriptor block: marks the beginning of a transaction and its subsequent journaling blocks, including their final fixed on-disk location
- the descriptor block is followed by medata only / data + metadata in writeback and ordered journaling / data journaling mode
- journal logs full blocks (instead of diff): if a single bit in the bitmap changes, its entire block is logged
- journal commit block: end of the transaction; once written, the journaled data can be recovered without loss
Checkpointing
- in ext3, a basic form of redo logging is used
- scans the log for committed complete transactions, discard incomplete transactions
Analysis of ext3 with SBA
- Modes and Workload
- random writes have better performance with logging vs. without
- sequential higher bandwidth than random
- fewer fsync call -> higher bandwidth
- small cost of journaling
- writeback & ordered -> similar performance to no logging
- performance of data journaling is irregular -> sawtooth pattern
- Concurrent synchronous unrelated transactions can result in very low bandwidth
- Journal Commit Policy
- impacted by settings of the commit timers
- flush metadata frequently, flush data slowly
- enforce disk write order by modes
- impacted by settings of the commit timers
Improve ext3 with STP
- place journal in the middle -> less seek time
- adaptive journaling mode: sequential writes -> ordered journaling; random writes -> data journaling
- separately flush unrelated transactions to disk
- differential journaling
- What problem does this paper address?
- How to accurately measure a file system's usage of a disk?
- How to gauge the performance differences if there are modifications to the system?
- What is the author's main insight?
- Use a block-level analysis tool that interposes the file system and disk
- Understand what is truly happening from the disk's point of view
- Use synthetic workload to gauge what will happen if certain system changes are made
- What are the paper's key strengths?
- Help understand why the file system is performing as it is
- Gauge performance changes of system modifications
- What are the paper's key weaknesses?
- STP not applicable to major system changes
- SBA/STP cannot give insight on how to change it
- All modifications are stored to disk sequentially in a log-like structure
- speeds up both small file writing and crash recovery
- almost no seek time for writes
- log contains indexing info for efficient reads
- Faster small-file writes, others similar
- Segment:
- divide log into segments
- use a segment cleaner to compress the live information from heavily fragmented segments
- Cleaning policy:
- segregates older, more slowly changing data from young rapidly-changing data; tread them differently during cleaning
- Challenges:
- How to retrieve information from log: indexing
- Ensure there are always large extents of free space available for writing new data: segment cleaning
- Too many small accesses: cuz spread information around the disk
- Synchronous writes: for small files traffics, synchronous metadata writes dominate disk traffic, unable to benefit from fast CPU
Fundamental Idea
- To reduce # of disk I/O: buffer small writes in main memory, then write all of them to disk sequentially in a single disk I/O
- Information to write includes: file data block, attributes (inode), index blocks, directories, all other info to manage file system
Inode
- Inode is written to log (instead of at a fixed-location in UNIX FFS)
- inode map: maintain current location of each inode
- given a file descriptor, inode map must be indexed to determine the disk address of the inode
- divided into blocks, written to log
- a fixed checkpoint region stores location of all inode map blocks
- inode map usually cached in main memory and does not require disk access
- A disk is divided into segments for management
Goal
- Maintain large free space for writing new data
- Manage small segments at random locations
Threading
- leave the live data in place, thread the log through the free extents
- cause free space to become severely fragmented
- afterwards, large continuous writes are not possible, no faster than traditional file systems
Copying
- Copy live data out of the log
- live data written back in a compacted form at the head of the log
- costly, especially for long-lived files. Need to copy long-lived files in every pass of the log.
Sprite LFS
- A combination of threading and copying
- The disk is divided into segments
- Always write to segment from its beginning to its end
- All live data needs to be copied out of a segment before rewriting to the segment
- Can thus compact long-lived files to certain segments and skip them during copying
- Segment is large enough so can use full bandwidth
Segment Cleaning
- segment cleaning: the process of copying live data out of a segment
- three step: 1) read a number of segments into memory; 2) identify live data; 3) copy live data compactly into a smaller number of cleaner segments
- Use segment summary block to identify which file this block belongs to and its block number -> inode points to new block
- Clean cold segments more frequently than hot segments
Checkpoint
- a position in the log at which all of the file system structures are consistent and complete
- Two-phase process to create a checkpoint
- Writes out all modified info to the log (i.e. file data blocks, indirect blocks, inodes, inode map, segment usage table)
- writes a checkpoint region to a fixed position on disk (i.e. address of inode map blocks, segment usage table, current time, last segment written)
- checkpoint periodically
Roll-forward
- After reading checkpoint, scan through the log segments
- new inode -> update inode map
- adjust utilization in the segment usage table
- Directory operation log: used to persist directory changes
- Temporal locality: data written at similar time are closer
- LFS handles random small writes more efficiently
- slower sequential reads
- similar random reads
- Cleaning cost is reasonable
- What problem does this paper address?
- Disk performs poorly with random small write workloads, how to optimize them with reasonable overheads?
- What is the author's main insight?
- Store everything in a log
- random writes become sequential writes
- buffer random/sequential small writes to a large sequential write
- Use segment cleaning to ensure a large contiguous space to write log
- Use segments, both threading and copying to achieve efficient segment cleaning
- What are the paper's key strengths?
- Very efficient random small writes
- Reasonable overheads
- What are the paper's key weaknesses?
- Overheads may not be worthwhile for workloads of large sequential writes
- A two-level storage hierarchy
- upper level: two copies of active data, provide full redundancy, excellent performance
- lower level: RAID 5 parity protection, inactive data, excellent storage cost, somewhat lower performance
- Automatically and transparently manages migration of data blocks between the two levels as access patterns change
- Easy-to-use, suitable for a wide variety of workloads, largely insensitive to dynamic workload changes
- Different RAID levels have different performance characteristics and perform well only for a narrow range of workloads
Assumptions
- Data needs to have active and inactive parts (o/w cost performance will reduce to that of mirrored data)
- Active subset must change relatively slowly over time (o/w resources wasted on switching between levels instead of doing work)
- Studies show these conditions are usually met in practice
Implementation Approaches
- Manually, the system admin:
- pro: human intelligence, extra knowledge.
- con: error prone, cannot adapt changing access patterns, require highly skilled people
- File system, perhaps a per-file basis:
- pro: balance between knowledge and implementation freedom
- con: too many file systems, hard for deployment
- Smart array controller, behind a block-level device interface:
- pro: very easily deployable
- con: knowledge about files has been lost
Mapping
- Host block addresses are internally mapped to their physical locations. Allows transparent migration of individual blocks
Mirroring
- Active data are mirrored for best performance and provide single-disk failure redundancy
RAID 5
- Write-inactive data are stored in RAID 5 for best cost capacity, somewhat lower read performance, single-disk redundancy
- Large sequential writes go directly to RAID 5 -> benefit from high bandwidth for this access pattern
Adaptation to Changes in the Amount of Data Stored
- Initially the array is empty
- As data are added, internal space is allocated to upper level until no more data can be stored this way
- Some of the storage space is automatically reallocated to the RAID 5 storage class, data are migrated down into it from the mirrored storage class
Adaptation to Workload Changes
- When active set of data changes, newly active data are promoted to mirrored storage
- less active data are demoted to RAID 5
- Keep the amount of mirrored data roughly constant
On-Line Operations
- online storage capacity expansion
- easy disk upgrades
- controller fail-over
- active hot spare
Log-Structured
- Writing to RAID 5 in a log-structured fashion to avoid reading old-date or old-parity for better performance
Physical EXtents (PEXes)
- disk is broken up into PEXes
- 1MB in size
- Several PEXes combined to make a Physical Extent Group (PEG)
Physical Extent Groups (PEGs)
- 1 PEG consists of several PEXes
- PEXes are allocated to PEGs to 1) balance the amount of data on the disks; 2) retain the redundancy guarantees
Segment
- Unit of contiguous space on a disk
- PEX divided into a set of 128KB segments
Relocation Blocks (RB) - Logical Data Layout
- logical space provided by the array is relocation block
- basic unit of migration
- Size of RB
- smaller -> more mapping info to record, increase seek and rotation time
- larger -> increase migration costs if only small amount of data updated in each RB
Read Requests
- Similar to the original RAID
Write Requests
- Data is first written to non-volatile memory
- When flushed to disk, promotion of RB if it is not in RAID 1
- If not enough space, demote least-recent-written RB from RAID 1 to RAID 5
- If thrashing happens (working set too large), do not promote working RB
Background Migration
- Goal: maintain enough empty RB slots in RAID 1 storage
- Demote least-recent-written RBs
Log-Structured RAID 5
- No need to read old data to update data
- Eliminate RAID 5's small write problems
- Write goes to 2 disks for upper level write -> only need to wait for NVRAM
- Compaction: clean RAID5 and plug holes in the mirrored disks
- What problem does this paper address?
- How to design an easily deployable automatically tuned RAID storage system that achieve both performance and capacity?
- What is the author's main insight?
- Two levels: RAID 1 and RAID 5
- Implement at smart array controller level to achieve the highest deployability
- Automatic compaction, migration, and balancing between two levels
- What are the paper's key strengths?
- Able to achieve both performance and capacity
- Automatically adjusted to workloads
- What are the paper's key weaknesses?
- Lacking information on files/applications, may not be the optimal solution
- Recoverable Virtual Memory: regions of a virtual address space that provides transactional guarantees.
- RVM: an efficient, portable, and easy-to-use implementation of recoverable virtual memory
- Enable independent control over the transactional properties of atomicity, durability, and consistency
- RVM is a balance between system-level concerns of performance and functionality, and software engineering concerns of usability and maintenance
- Minimalist design
Functionality
- Layered approach
- RVM only handles atomicity, process failure
- Removes nesting, distribution, concurrency, resiliency to media failure
OS Dependence
- Only depend on small, widely supported OS interface
- No tight coupling with the OS
- RVM is the building block for meta-data in systems
- External data segment:
- RVM's backing store for a recoverable region, is completely independent of the region's VM swap space
- A system file or a raw disk partition
Structure
- Make RVM as a library that is linked in with an application
- Trust application and RVM will not damage each other
- Applications cannot share a single WAL on a dedicated disk
- Each process using RVM has a separate log
- Can be in a Unix file or on a raw disk partition
- If a file, RVM uses fsync to flush
Segments and Regions
- Recoverable memory is managed in segments
- Backing store for a segment is the external data segment (file or raw disk partition)
- region: a segment is divided into regions
- corresponds to a related collection of objects
- can be as large as the entire segment
- Applications explicitly map regions of segments into their virtual memory
- RVM ensures the newly mapped data represents the committed image of the region
- Copy when a region is mapped
- Copying of data from external data segment to virtual memory happens when a region is mapped (instead of on page demand)
- Copy the process' recoverable memory as a whole
- startup is slower
- Segment mapping:
- no region of a segment may be mapped more than once by the same process
- mappings cannot overlap in virtual memory
- Regions can be unmapped at any time
RVM Primitives
- Initialization: specify by a process which log to use by RVM
- Map: once per region to be mapped. Specify which external data segment and the range of virtual memory addresses for the mapping.
- Unmap: unmap a region. The region can be mapped to the process' other virtual address space
- begin_transaction: let RVM a transaction begins, with a transaction id
- set_range: let RVM know that a certain area of a region is about to be modified. RVM then record the current value of the area, in case of abort.
- Read operations on mapped regions require no RVM intervention
- end_transaction/abort: commit/abort a transaction
- a successful commit guarantees permanence of changes made in a transaction
- no-flush: commits don't flush instantly, trade performance with weaker permenence guarantee
- flush: block until all committed no-flush transactions have been forced to disk
- no-restore: application won't explicitly abort transaction, no undo
- truncate: blocks until all committed changes in the WAL have been reflected to external data segments
Log Management
- Use no-undo/redo logging: no uncommitted changes reflected to external data segment; only the new-value records of committed transactions have to be written to the log
- Upon commit, old-value records are replaced by new-value records, reflect the current contents of the corresponding ranges of memory, is forced to the log and write out to external data segment.
- No-restore and no-flush are efficient:
- no-restore needs no copying old-value to prepare for explicit abort, time and space efficient
- no-flush needs not to flush to external data segment, lower latency
Crash Recovery
- RVM scans the log from tail to head
- construct an in-memory tree of the latest committed changes for each external data segment
- trees are traversed, applying modifications
- the head and tail location information in the log status block is updated to reflect a new empty log
Log Truncation
- reclaim space allocated to log entries by applying the changes contained in log to the recoverable data segment
- Intra-transaction optimization: redundant set-range calls are ignored
- Inter-transaction optimization: check for modifications to the same address, only log the last one
- What problem does this paper address?
- How to enable applications to manage recoverable virtual memory, with clear failure semantics
- What is the author's main insight?
- Do one thing well
- Focus on providing a persistent metadata management only
- Ignore other features that can be layered on top
- Use no-undo/redo log
- use external data segment for persistent storage
- What are the paper's key strengths?
- Did one thing well
- What are the paper's key weaknesses?
- No demonstration of how ignored features can be layered on top
- Haystack is an object storage system
- Reduce per photo metadata
- Haystack storage machines can perform metadata lookup in memory, leaving disk bandwidth to actual data transfers
- special workload: written once, often read, never modified, rarely deleted
- Limitation of NFS-based storage:
- long tail:
- cache doesn't work for old/unpopular photos
- Linear directory works badly for many photos/directory
- many disk I/Os for an image
- directory's block map too big to cache in memory
- long tail:
High throughput and low latency
- Keep all metadata in main memory
- Reduce metadata per photo
- Achieves high throughput and low latency
- at most 1 disk operation per photo
Fault-Tolerance
- Replicate each photo in geographically distinct locations
Cost-effective
- Cost 28% less per usable terabyte
- can process 4x more reads per seconds
Simple
- simple, so easier build and deploy
- Store multiple photos in a single file
Haystack Store
- persistent storage system for photos
- only component that manages filesystem metadata (used to retrieve photo)
- Store's capacity is divided into physical volumes
- Physical volumes on different machines are grouped to be logical volumes
- Each write to logical volumes is written to all its physical volumes for redundancy
Haystack Directory
- Logical to physical volumes mapping
- logical volume where each photo resides
- logical volumes with free space
Haystack Cache
- internal CDN
- a cache for popular photos
- reduce dependence on external CDNs
- distributed hash table
- Only cache
- not from CDN, but directly from client
- from writable machine (photos are accessed the most right after they are uploaded)
Read Workflow
- Browser requests URL from Web Server, which generates a path using information from Directory
- Browser sends request to Cache, which sends to Store if not cached
- Store looks up the relevant metadata in its in-memory mapping, seeks the appropriate offset in the volume file, reads the entire needle, do some sanity checking, and return
![image-20210625200546365](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210625200546365.png)
Write Workflow
- Web server provides the logical volume id, key, alternate key, cookie, and data to Store
- Each Store machine synchronously appends needle images to its physical volume files, updates in-memory mappings
- Updates to the same needle are appended to the same physical volume, determine the newest update with the highest offset
![image-20210625200703034](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210625200703034.png)
Delete workflow
- Set delete flag in in-memory mapping and volume file
Index File
- Index file: a checkpoint of the in-memory data structures, used to locate needles efficiently on disk
- Used to reboot in-memory mapping quickly
- Since index file is updated asynchronously, there are orphans which are not in the index file during restart -> create a matching index record and append that record to the index file will do the trick
Filesystem
- Use XFS reasons:
- blockmaps for several contiguous large files can be small enough to be stored in main memory
- efficient file preallocation
Failure Recovery
- Detection: background task periodically test connection, check availability, attempts to read data from each Store machine. If fails, mark read-only
- Repair: bulk sync from another replica machine
Compaction
- Compact fragmentation due to deletion
Batching
- Batch uploads whenever possible
- What problem does this paper address?
- How to efficiently manage photos, need high throughput and low latency
- What is the author's main insight?
- Multiple photos in one file (photo offset becomes a metadata) to reduce metadata overhead
- Reduce photo metadata size and keep all photo mapping in-memory, at most one disk operation per photo
- What are the paper's key strengths?
- Considers the specific application of Facebook photo storage, CDN environment, and usage patterns
- What are the paper's key weaknesses?
- If the number of photos increases faster than main memory size, then there will be a point where memory is not capable of storing all photo metadata. What to do then?
-
No seek or rotational time
-
Latency = queuing time + controller time + transfer time
-
Highest throughput -> sequential OR random read
-
Data read and written in page-sized chunk (4KB)
-
Before writing, entire block needs to be erased (256KB)
- Cannot overwrite exiting block
-
Flash Translation Layer (FTL): translates logical block address to physical block address in flash storage
Limitation
- Frequent random writes to an SSD would incur internal fragmentation, increase SSD's I/O latency, reduce device lifetime
Flash-friendly on-disk layout
- Three configurable units: segment, section, and zone
- allocates storage blocks in unit of segment from a number of individual zones
- clean in unit of section
- These units align with the underlying FTL's operational units to avoid unnecessary and costly data copying
Cost-effective index structure
- Wandering tree problem: recursive updates of index blocks (data -> direct -> indirect)
- Node address table: index table to attack this problem
Multi-head logging
- Hot/cold data separation scheme during logging time (i.e. block allocation time)
- Multiple active log segments concurrently
- append data and metadata to separate log segments based on their anticipated update frequency
- flash storage devices exploit media parallelism -> multiple active segments can run simultaneously without frequent management operations
- performance degradation is insignificant
Adaptive logging
- At high storage utilization, F2FS use threaded logging
- writes new data to free space in dirty segment without cleaning it in the foreground
fsync acceleration with roll-forward recovery
- F2FS optimizes small synchronous writes to reduce the latency of fsync requests
- minimizes required metadata writes and recovers synchronized data with an efficient roll-forward mechanism
Entire volume into 6 areas
- Superblock (SB): basic partition information, default params of F2FS; non-changeable
- Checkpoint (CP): keeps file system status, bitmaps for valid NAT/SIT sets, orphan inode lists, summary entries of currently active segments. Stores two recovery points: last stable version, intermediate version.
- Segment Information Table (SIT): per-segment information, number of valid blocks in the segment, bitmap for validity of all blocks in the Main area. For cleaning, used to select victim segments and identify valid blocks.
- Node Address Table (NAT): block address table, locate all the node blocks in the Main area
- Segment Summary Area (SSA): stores parent information of all blocks in the Main area. During cleaning, SSA is used to identify parent node blocks before migrating valid blocks.
- Main Area: filled with 4KB blocks. Either node block or data block:
- node block: inode or indices of data block
- data block: directory or user file data
Look-up operation for "/dir/file"
- use NAT to find and read root inode, then use root inode block to find its data block
- use root inode's data block to find dir's inode number
- use NAT to translate dir's inode number to a physical location
- find dir's inode, use dir's inode to find dir's data block, which finds file's inode number
- repeat translation and location for file
- Actual file data is somewhere in Main area
- NAT: stores the physical locations of all node blocks, through node ID
- node block:
- inode: file's metadata
- direct: block address of data
- indirect: node IDs of other node blocks
- Wandering tree problem:
- problem: in traditional LFS, when leaf data is updated, its direct and indirect pointer blocks need to be updated
- solution: only update one direct node block and its NAT entry
- inline data and inline extended attributes in the inode block: reduces space requirements and improve I/O performance
- 4KB directory entry (dentry) block: a bitmap and two arrays of slots and names in pairs
- bitmap tells whether each slot is valid or not
- a slot has a hash value, inode number, length of a file name, file type
- directory file: constructs multi-level hash tables to manage a large number of dentries efficiently
- When a file name is given in a directory, F2FS calculates its hash value, and traverses the multi-level hash tables in O(logn) complexity
- Six major log areas for hot/warm/cold data
- Optimize cleaning:
- Optimally, the number of valid blocks is either 0 or all
- Foreground and background cleaning
Victim selection
- Foreground: greedy (i.e. smallest # of valid blocks first) to minimize latency to application
- background: cost-benefit (i.e. utilization + age) to separate hot and cold data
Valid block identification and migration
Foreground
- validity bitmap per segment in SIT
- For each valid block, F2FS uses SSA to retrieve parent node blocks containing its index.
- if valid, migrate
Background
- loads the block into page cache and marks them dirty
- lazy migration to alleviate performance impact on foreground I/O activities
- allow combining small writes
Post-cleaning process
- victim section becomes pre-free
- after a checkpoint, the section becomes free
- normal logging: blocks are written to clean segments, strictly sequential writes. When storage becomes almost full, suffer cleaning overheads
- threaded logging: writes blocks to holes in existing dirty segments. No cleaning required, trigger random writes -> degrade performance
- F2FS switch between them based on the number of clean segments available
- Checkpointing: provide a consistent recovery point from system crashes
Checkpointing Procedure:
- All dirty node and dentry blocks in the page cache are flushed
- suspends ordinary writing activities such as create, mkdir...
- system metadata, NAT, SIT, SSA are written to disk
- F2FS writes a checkpoint pack
Checkpoint pack
- 2 checkpoint packs in total
- Header and footer: start & end, versioning
- NAT and SIT bitmaps: validity of current NAT and SIT blocks
- NAT and SIT journals: contain a small number of recently modified entries of NAT and SIT to avoid frequent NAT and SIT updates
- Summary blocks of active segments: in-memory SSA blocks that will be flushed to the SSA area
- Orphan blocks: keep orphan inode info
- roll back to the latest consistent checkpoint
Strategy to fast fsync
- write data blocks and their direct node blocks only, ignoring all other blocks
- collects the direct node blocks with special flag N+n, construct a list of their node info
- Use the node info in the list, loads the most recently written node blocks, named N-n, into the page cache
- compares the data indices in between N-n and N+n
- if different data indices, refreshes the cached node blocks with new indices from N+n, mark dirty
- checkpoint
- What problem does this paper address?
- How to design a FS with a different storage system (i.e. SSD)? How to utilize its benefits and alleviate its drawbacks.
- What is the author's main insight?
- Hot/cold data logging
- align segment/section/zone with SSD layout
- optimize fsync performance
- adaptive logging based on storage utilization
- Handle wandering tree problem with NAT
- Cleaning: background and foreground
- What are the paper's key strengths?
- Adjust to SSD's characteristics
- What are the paper's key weaknesses?
- Designed specifically for SSD?
- Use of virtual block device
- Use RAM disk
- manually delay I/O requests
- variable latency is not emulated
- Evaluate file systems across different device speed
- Almost all file systems improve performance when underlying storage latency decreases
- Performance flattens out under 1ms
- EXT4 good across many environments
- F2FS better at latencies below 1ms
- Transaction: a collection of one or more operations on one or more databases, which reflects a single real-world transaction
- ACID: Atomicity, Consistency, Isolation, Durability
- Consistency:
- Transaction starts with the assumption that integrity constraints hold
- After transaction, all integrity constraints still hold
- Threats to data integrity
- Need for application rollback
- system crashes
- Concurrency
- Consistency levels: strict serializability, linearizability, sequential consistency, causal consistency, eventual consistency
- Lock modes:
- exclusive/write (X)
- shared/read (S)
- X conflicts with X and S; S and S do not conflict
- Strict two-phase locking
- locks are incrementally acquired in a specific order
- all locks are released atomically when txn ends, as part of the commit or abort
- If each transaction does strict two-phase locking, then executions are serializable
Data item granularity
- one tuple in a table
- a page, several tuples
- a table
Granularity tradeoff
- larger: fewer locks -> less overhead; less concurrency possible
- smaller/finer: more locks -> more overhead; more concurrency possible
- System usually gets fine grain locks until there are too many of them, then they are replaced with larger granulariy locks
Multigranular locking
- A hierarchy of locks, needs to manage conflicts among items of different granularity
- System gets "intention" mode locks on larger granules before getting actual S/X locks on smaller granules
- A hierarchy of locks, each child is at a finer granularity of its parent
- Intention mode: used to tag (lock) ancestor of a locked node
Modes
- Exclusive (X): gives exclusive access to the requested node and all its descendants
- Share and Intention Exclusive (SIX): gives shared access to the requested node and all its descendants. gives intentional exclusive access to the requested node. Requester is able to acquire X, SIX, or IX lock on descendants.
- Share (S): gives shared access to the requested node and all its descendants.
- Intention Exclusive (IX): gives intention exclusive access to the requested node. The requestor can acquire X, SIX, S, IX, IS lock on descendants.
- Intention Shared (IS): gives intention shared access to the requested node. The requester can acquire S, IS lock on descendants.
Compatibility
X | SIX | S | IX | IS | |
---|---|---|---|---|---|
X | - | ||||
SIX | - | - | |||
S | - | - | Y | ||
IX | - | - | - | Y | |
IS | - | Y | Y | Y | Y |
Policies
- Before requesting S or IS on a node, all ancestors of it needs to be in IX or IS lock
- Before requesting X, SIX, IX on a node, all ancestors needs to be in SIX or IX lock
- Locks are either: a) released atomically; b) from lower to higher level
- What problem does this paper address?
- How to manage locks under the tradeoff of concurrency and overhead?
- What is the author's main insight?
- Hierarchical locking
- Intentional modes
- extensible to DAG
- What are the paper's key strengths?
- Good balance on the tradeoff for small database
- What are the paper's key weaknesses?
- For large database, still too much overhead
- Read uncommitted (lv 1): can read uncommitted writes
- No S-locks at all
- Read committed (lv 2): only read committed writes, but distinct reads within the same transaction on the same data item could be different
- S-locks acquired during read, released immediately
- Repeatable read (lv 2.99): only read committed writes, guarantees serializability on individual data items, but phantom reads could happen (e.g. read different # of entries in a table)
- S-locks acquired til transaction ends
- Serializable (lv 3): completely isolated
Snapshot Isolation
- Every transaction sees a "snapshot" of the database, at an earlier time
- Read may not give current value
- Use old versions to find value
First Committer wins (FCW)
- A transaction T is not allowed to commit if any other transaction has modified the same item after T starts, and committed earlier than T
- T must hold write locks on modified items at time of commit
SI benefits
- reads never blocked
- reads don't block writes
- avoid common anomalies:
- no dirty data
- no lost update
- no inconsistent read
- no phantoms
- What problem does this paper address?
- A clear definition of isolation levels
- What is the author's main insight?
- 4 levels
- What are the paper's key strengths?
- What are the paper's key weaknesses?
- in isolation levels, there is a tradeoff between consistency vs. performance
- PL-1 (Read uncommitted): T's writes are completely isolated from the writes of other transactions
- PL-2 (Read committed): T has only read updates of transactions that have committed by the time T commits
- PL-2.99 (Repeatable read): T is completely isolated from other transactions with respect to data items; has PL-2 guarantees for range reads
- PL-3 (Serializable): T is completely isolated from other transactions
- What problem does this paper address?
- Make isolation levels generalized and applicable for other isolation protocols
- What is the author's main insight?
- Identify behaviors of each level, apply to generalized protocols
- What are the paper's key strengths?
- not only pessimistic, but also optimistic and multi-version concurrency control
- What are the paper's key weaknesses?
-
IPC is used to allow communication and synchronization between processes
-
Two methods: shared memory and message passing
- Allows two or more processes to have access to the same memory region, can exchange data without copying it
- usually implemented via memory mapped files; shared memory is mapped to address space of each process that requires access
- Preferred when processes need to exchange large amounts of data
- Two processes establish a communication link, exchange messages using send/receive
- Fixed or variable size: fixed easy for OS designer, variable easy for programmer
- a standard message has a header and a body
Direct vs. Indirect
- Direct communication link: receiver identity is known and message is sent directly to the receiving process. There is at most one link between two processes.
- Drawback: limited modularity: changing the identity of a process requires notifying all sender and receiver having a link
- Indirect communication link: via a shared mailbox (or port), which consists of a queue of messages. The sender keeps the message in mailbox and the receiver picks them up.
- advantage: the mailbox can be later bound to another process; allow multiple senders to one mailbox
- drawback: sender doesn't know which process will actually receive its message
Blocking vs. non-blocking
- decide if send/receive ops are blocking (synchronized) or not
Buffering
- Size of receiver's queue
- no queue: sender has to wait until receiver is ready
- bounded queue: sender needs to wait if queue is full
- unbounded: sender never needs to wait. Designer needs to be careful with physical resource limits
Shared Memory | Message Passing |
---|---|
Shared memory region | Message passing facility |
Used for communication between processes on a multiprocessor system where communicating processes reside on the same machine (needs to share a common address space) | Used in a distributed environment where communicating processes on remote machines, connected via network |
Application programmer explicitly writes code for write/read ops to/from shared memory | Message passing facility programmer does low-level coding |
Maximum speed of computation, no syscall during communication | Time consuming, needs kernel intervention (syscalls) |
Processes need to avoid racing conditions | Useful for sharing small amounts of data, no racing conditions |
Faster | Slower |
- monitor unifies:
- the synchronization
- the shared data
- the body of code which performs the accesses
- data is protected by monitor, can only be accessed within the body of a monitor procedure
- monitor ensures that at most one process is executing a monitor procedure at a time
- Scaling shared mutable data is a difficult problem
- Two approaches:
- Guarantee consistency
- serializing all updates, hard to scale
- Guarantee scalability
- gives up consistency guarantees
- e.g. use last-writer-wins (LWW)
- Optimistic replication:
- replicas diverge
- resolve with LWW-like methods or serialization
- Guarantee consistency
Commutative Replicated Data Type (CRDT)
-
ensures Strong Eventual Consistency (SEC): eventual consistency (i.e. may have conflict) with the guarantee that all replicas that have received the same updates (maybe different orders) will eventually reach the correct state
-
replicas converge if:
- concurrent updates are commutative
- all replicas execute all updates in causal order
-
No need for consensus-based concurrency control because there is no conflict
-
Ensures consistency at large scale at a low cost
-
e.g. a set with a single add-element operation
- Abstraction of an ordered sequence of atoms
Model
- A collection of sites (nodes), each carrying a replica of a shared ordered-set object, connected by a reliable broadcast protocol
- Support a peer-to-peer, multi-master execution model
- Some replica initiates an update and executes it against its local replica
- each other site eventually receives the operation and replays it locally
- causally-related operations execute in order, but concurrent operations may execute in different orders in different replicas
- operations
- insert(ID, newatom): adds newatom with ID to the ordered-set
- delete(ID): deletes the atom with ID from the ordered-set
- assumptions:
- two inserts or deletes on different IDs commute
- operations are idempotent
- To ensure commutativity across different sites, we ensure that no two IDs are equal across sites
- ID properties:
- Two replicas of the same atom have the same ID
- no two atoms have the same ID
- ID remains constant
- There is a total order for IDs
- ID space is dense (i.e. can allocate atom between any two IDs)
Treedoc Structure
- A tree
- An atom identifier is TID, a path in a tree
- if the tree is balanced, average TID size is O(log n)
- mini-node: internal nodes within a node, to solve concurrent TID allocations
- disambiguator: identifies which site inserted each mini-node; unique, ordered, gives total order between entries
- major node: contains mini-nodes
Treedoc insert and delete
-
delete: discards the atom with TID, mark the tree node as a tombstone
-
insert: chooses a fresh TID, insert it in a tree as you normally do (append 0 or 1 for left or right child)
Treedoc Restructuring
-
After a period of time, the tree is either badly balanced or filled with tombstone
-
flatten: transforms the tree into a flat array, eliminate all storage overhead
-
flattening operation changes TIDs, so allow non-ambiguous renaming
-
flattening is not commutative with update ops:
- update-win approach: when update happens concurrently with a flatten, update wins and flatten aborts
- use two-phase commit for coordination
-
problematic with large-scale and dynamic systems
-
flatten results:
- decrease file size
- decrease latency
Nebula
- core: small group of well-known and well-connected sites, only group that participates in flattening; membership protocol needed to join/leave
- nebula: not in the core, does not participate in flattening
- epoch: interval between flattening
- sites may exchange updates only if in the same epoch
- core sites have the same epoch
- nebula may be behind
- Nebula catch-up protocol
- If core is in epoch n, nebula in epoch n-1
- core sends all n-1 epoch updates to nebula, which then catch up to n by flattening its local tree
- nebula transforms its old message to epoch n format, then send to core
Approximate causal ordering
- vector clocks are used to ensure causal ordering and to suppress duplicate messages
- causal ordering is already in Treedoc:
- insert a node before inserting its descendant
- insert a node before deleting it
- unrelated nodes can be updated concurrently
- What problem does this paper address?
- How to achieve both consistency and scalability with a special yet generalizable data type?
- What is the author's main insight?
- Commutative Replicated Data Type
- Treedoc
- What are the paper's key strengths?
- Achieve both consistency and scalability with CRDT
- Demonstrates that practical CRDT exists
- What are the paper's key weaknesses?
- Coordination is still required at flattening
- Minimizing coordination is the key to high-performance, scalable, high-availability database design
- Coordination: multiple concurrent processes need to synchronize state/operations, o/w stall, in order to achieve correctness
Invariant Confluence
- if an application's operations are I-confluent, a database can correctly execute them without coordination
- if operations are not I-confluent, then coordination is required to ensure correctness
- I-confluence exposes a trade-off between the operations a user wishes to perform and the properties she wishes to guarantee
- Try to define operations that ensure I-confluence
Cost of Coordination
- High latency
- Low throughput
- partial failure -> unavailability
- What problem does this paper address?
- When is coordination strictly necessary to maintain application-level consistency?
- What is the author's main insight?
- Identify invariants in the database, and conduct I-confluence analysis on invariants to see if certain coordination is necessary.
- I-confluence presents a trade-off between operations a user wishes to perform and the properties she wish to guarantee
- What are the paper's key strengths?
- Identify scenarios where coordination is unnecessary given certain database operations
- What are the paper's key weaknesses?
- It is non-trivial to identify invariants, which need to be comprehensive
Transaction Failure
- e.g.
- Code aborts due to input/database inconsistency
- Mechanical aborts due to concurrency control solutions
- Frequent events, need instant recovery
System Failure (Fail stop)
- DBMS bug, OS fault, HW failure: volatile memory dies, durable memory survives
- Infrequent events, minute to recover
Media Failure (fail-stop)
- IO code bugs, disk HW failures: lose disk info
- Rare events, hours to recover from checkpoint & log
- Log is written directly to non-volatile storage
- Undo: rollback aborted transaction
- transaction failure or system failure
- Redo: repeat transaction on old DB data
- system failure or media failure
- Current DB: on disk + memory buffers
- Some changes in memory, lost during failures
- Materialized DB: the state of DB right after crash restart, and before applying log
- Some completed changes may not be visible because the buffer was lost
- Physical DB: on disk
- All blocks including out-of-date blocks & possibly incomplete data structures
Connecting views
- Changing non-volatile memory
- Modifications of current DB may cause writes to physical DB, which is not part of materialized DB yet
- After pointer structs are updated (i.e. propagated), then those modifications are part of materialized DB now
- Some DBs overwrite prior copies, so write is equivalent to propagate. But changes to materialized DB becomes non-atomic
- Log contains redundant information to cope with failure
- On-disk temporary WAL contains all info needed to transform materialized DB to current DB
- Memory pressure can push uncommitted dirty data to database
- in non-overwrite DB, such writes are forgotten when memory is lost
- in overwrite DB, this requires UNDO log records (STEAL) written before propagation
- Commit logically forces propagation (FORCE), but efficiency concerns cause DBs to avoid synchronous IO, instead writing REDO log records before transaction commit
- Log types:
- Physical logging: capture data values directly
- Logical logging: capture operations that give values
- State logging: capture full values
- Transition logging: capture value difference
- Page logging: capture full page values
- Record logging: capture only the records changing
- Tradeoff: simplicity/speed of recovery vs. density of log
- Checkpoint limits the amount of REDO processing required
- REDO starts at beginning of log, can be really slow
- FORCE propagation means no REDO (just UNDOs)
- Transaction consistent:
- Quiesce all transactions, propagate all dirty data, write log entry
- allows partial REDO to start here
- allows global UNDO stops here (o/w UNDO needs to find BEGIN for oldest incomplete transaction)
- Action consistent:
- Quiesce all transaction-caused actions, propagate all dirty data, write log entry
- allow partial REDO recovery processing to start here
- Fuzzy checkpoint:
- propagate only to log, to reduce REDO processing on restart
- What problem does this paper address?
- A summary of database recovery techniques and their pros and cons
- What is the author's main insight?
- A summary of database recovery techniques and their pros and cons
- What are the paper's key strengths?
- Comprehensive
- What are the paper's key weaknesses?
- A survey, so no?
- satisfies clock consistency condition: if event A's clock comes before event B's clock, then A happens either before or at the same time as B.
- vector clock satisfies strong clock consistency condition: if event A's clock comes before B's clock, then A happens strictly before B.
- A distributed system consists of a collection of distinct processes which are spatially separated, and communicate by exchanging messages.
- A system is distributed if the message transmission delay is not negligible compared to the time between events in a single process.
- In a distributed system, the relation "happen before" is only a partial ordering of the events in the system. It is impossible to say that one of two events occurred first.
- A single process: a set of events with an a priori total ordering
- A distributed system: a collection of processes
- Happened before "->": 1) same process, a before b; 2) a send, b receive; 3) a->b, b->c, a->c.
-
Logical clocks: define a clock
$C_i$ for each process$P_i$ which assigns a number$C_i$ to any event a in that process. The entire system of clocks is represented by the function C which assigns to any event b the number $C$, where $C=C_j$ if b is an event in process $P_j$. No assumption about the relationship between C and physical time (e.g. could be counters). - Clock condition: For any events a, b: if a->b, then $C$ < $C$
-
Implementation Rule:
- Each process $P_i$ increments $C_i$ between any two successive events
- a) If event a is the sending of a message m by process $P_i$, then the message m contains a timestamp $T_m=C_i$. b) Upon receiving a message m, process $P_j$ sets $C_j$ greater than or equal to its present value and greater than $T_m$
- Total Order: Order the events by the times at which they occur. To break ties, we use any arbitrary total ordering < of the processes.
- Distributed algorithm: each process independently follows these rules, and there is no central process.
- What problem does this paper address?
- A clock protocol to demonstrate partial ordering in a distributed computing environment
- What is the author's main insight?
- Separate logical clock from physical clock
- Each machine has its own logical clock, which is adjusted when messages are received
- Ensures causal relationships by updating receiver's logical clock
- What are the paper's key strengths?
- A solution to achieve partial ordering in distributed systems
- What are the paper's key weaknesses?
- Does not exhibit strong clock consistency condition: if A's clock comes before B, Lamport's clock can only ensure that A comes before or at the same time as B (instead of strictly before B).
- Distributed object-oriented database system:
- persistent storage and transaction support at server;
- applications (and caching) at clients
- This scheme provides serializability & external consistency
- Serializability: commit transactions can be totally ordered, called serialization order. The actual effect of running the transactions is the same as running them one at a time in that order.
- External consistency: the serialization order such that if S commits before T in real time, then S is ordered before T
- Low time and space overhead
- Application runs on client machines, caches happen on clients
- methods execute on client machines, using locally cached objects
- objects are fetched from servers when needed, then cached on clients
- Server stores persistent storage
- Server tracks the object caches at clients
- Server has cache of objects in their main memory, used to satisfy fetch requests
- Each object has an owner server, where the object resides
- Front end: the code that manages the cache on client
- Keep track of objects read and written by a transaction T at its client
Protocol Overview
-
When the client requests a commit, front end gathers info on T:
-
validation information - what objects are used, and how (read/write)
-
installation information - modified new objects
-
*collected only for mutable objects (concurrency control is not required for read-only objects)
-
-
front end sends these two info to a server that owns some of the used objects
-
The server becomes a coordinator for a 2-phase commit protocol
- (Observed delay by clients) Coordinator sends both info to participants. Participants validate. If succeeds, log the installation changes, send ok back. If fails, send reject back. If all ok, the coordinator logs the commit, sends commit to all participants.
- Coordinator sends commit to all participants. Each participant then installs new versions of its objects, log a commit, send ack back to the coordinator.
- *Read-only transactions don't need phase 2
-
After a read-write transaction that has installations, an invalidation message is sent to all other clients. The front end then evicts the outdated value. If the outdated value is already read, abort the transaction immediately.
-
Clients send ack back to server after installation. Then the server removes the info about the invalidated objects.
- Want serializability and external consistency
- Backward validation:
- A validating transaction T is checked against all transactions that have already validated successfully
- For validation, only require cached sets, invalid sets, validation queue on each server, small enough to put in mem
Global Serialization
- Use timestamp from real clocks
- Loosely synchronized clock:
- clocks at different nodes in the network may differ by at most a small skew (tens of milliseconds)
- Each transaction is assigned a timestamp = {local time, server ID}
- The coordinator sends these info for validation:
- timestamp
- objects read
- objects written
- client id
- Transaction S and T conflicts when one has modified an object that the other read or modified
- All successfully validated transactions are recorded in a validation queue,
- when a new validating transaction comes, the participant checks to detect conflicts and ensure serializability and external consistency
- if the new transaction fails the check (even against a prepared transaction), the participant always aborts the validating transaction
Check against later transaction
- Due to difference in clock, some later transaction T in real time may receive a timestamp earlier than S.
- If T read any object that S modified, T must be aborted.
- O/w, this violates external consistency
- Later-conflict check: later transaction T with an earlier timestamp than S
- T must not read S's modifications
- T must not modify any object that S read
- This kind of conflict is rare, only happens in a small clock skew window
Check against earlier transaction
- For each validated transaction S with earlier timestamp than T,
- If S reads object x and T modifies x: no check
- If S modifies object x and T reads x:
- if S is not committed, check fails
- if S is committed, check if T is read the most current version of x
- version check using an invalid set of objects for each client
Truncation
- Threshold timestamp:
- guaranteed to be greater than or equal to the timestamps of all transactions whose information has been removed form the VQ
- Validation record is retained for
- all uncommitted read-write transactions
- all transactions with timestamps above the threshold
- A transaction T timestamped below the threshold fails validation because info necessary for the later-conflict check has been discarded
- Threshold Interval
- local time + (msg_delay + clock skew)
- Log validation information (VQ and threshold) on stable storage
- for read-write transactions, the validation record can be logged along with the installation information, so little latency
- for read-only transaction, since no information is recorded initially, we maintain a stable threshold
- which is always later than the timestamp of any transaction
- only used to restore threshold on restart
- page-based scheme that does locking and callbacks at the page level
- object level locking and callback for pages that exhibit read-write sharing
- What problem does this paper address?
- How to do optimistic concurrency control efficiently?
- What is the author's main insight?
- Clients use cached objects for transaction read, use a front end to track which objects are read/written
- After receiving a commit request, the server conducts a 2 phase commit to validate the transaction
- Use an efficient validation algorithm that is both time and space efficient
- Do not incur extra logging latency for read-only operations, add a stable threshold for crash recovery
- What are the paper's key strengths?
- An optimistic concurrency control protocol that is efficient for low and moderate contention levels
- Low number of message sent compared to pessimistic protocols
- What are the paper's key weaknesses?
- Hard to scale for high contention levels and higher number of clients
Phase 0: Leader Selection
- When a potential leader detects no leader is present, it chooses a unique ballot ID, which is higher than any it has seen so far, and send to all processes
- When a process receives a ballot ID, it elects the highest it has seen so far
- If a candidate receives a majority of OK, it becomes the new leader
- *It is possible that more than 1 leader exists, but Paxos still ensures safety
Phase 1: Prepare
- The leader sends a new proposal number n, sends the request to all processes
- Acceptor responds with:
- a promise of not accepting a proposal numbered less than n
- the proposal with the highest number less than n that is accepted, if any
Phase 2: Accept
- After receiving a majority of responds, the leader sends a proposal of <n, v>, where v is either the value of the highest number accepted proposal, or a new value chosen by the leader
Consensus achieved when?
- Achieved when a majority of acceptors hear the proposed value and accept it, before responding any message
- What problem does this paper address?
- How to achieve provable async, failure-stop consensus
- What is the author's main insight?
- Ask acceptors to promise ignoring lower proposal numbers
- What are the paper's key strengths?
- first provable async, failure-stop consensus
- What are the paper's key weaknesses?
- complicated to learn or implement
- Paxos does its job, but it is hard to implement or learn
- Raft provides better understandability, easier for both implementation and education
- Typically implemented using a replicated log
- Each server stores a log containing the same commands in the same order
- Keeping the log consistent is the job of a consensus algorithm
- Consensus algo ensures every log eventually contains the same log, even if some servers fail
Consensus Algorithm Requirements
- Ensures safety (never returns an incorrect result) under any non-Byzantine environment
- Ensures availability as long as a majority of machines are operational and can communicate with each other and with clients
- Timing does not affect consistency: faulty clocks and extreme message delays can at most cause availability issues
- Proceed when a majority of servers have responded, minor slow servers should not affect performance
Phase 1: Leader Election
- Current term timeout, a new term starts.
- A follower increments its current term and transitions to candidate state
- A candidate sends out requests to vote. Followers vote for at most one leader, first-come-first-serve basis.
- After receiving a majority of votes, becomes the leader
- If receive msgs from another leader with at least as large term number, recognize the leader
Phase 2: Log Replication
- Client sends requests to only the leader
- The leader first append the log (without applying it yet), then send appendEntry requests to all followers
- After receiving majority of acks, apply the log and send applyEntry requests to all followers
- A command is logged with its term number and a unique incremental index
- For each AppendEntry request, the leader piggyback previous log command's term and index to ensure log consistency (induction-like mechanism)
Log Inconsistency b/w Leader and Follower
- Leader will remove all inconsistent log entries in the followers
- Add leader's log entries to all followers
- Ensures that the new leader has the most up-to-date log entires by election process:
- A candidate needs to provide its most current log in its requestToVote msg, which will be denied by a follower if the follower's log is more up-to-date
- What problem does this paper address?
- How to improve understandability of consensus algorithm in an async, non-Byzantine environment?
- What is the author's main insight?
- Separate consensus into smaller subproblems: leader election, log replication, safety, configuration changes
- What are the paper's key strengths?
- Arguably better understandability and implementation
- What are the paper's key weaknesses?
- does the same job as paxos
- Chubby is a lock service used in a loosely-coupled distributed system consisting of moderately large numbers of small machines connected by a high-speed network
- Primary goals are availability and reliability for a moderately large set of clients
- Secondary goals are throughput and storage capacity
- Chubby's interface is similar to a simple file system that performs whole-file reads and writes, with advisory locks, with notification of events
- Choose a lock service (instead of a library or service for consensus)
- Serve small-files to permit elected primaries to advertise themselves and their parameters, rather than build and maintain another service
- Clients and replicas of a replicated service may want to know hen the service's primary changes, so some notification mechanism is needed
- consistent client-side caching to reduce server load
- file system interface
Coarse-grained lock service
- Creating a lock service rather than a client library made it easier to integrate with existing applications
- Chubby runs five replicas in a cell; client systems that depend on Chubby can work with fewer replicas (not possible with a client library)
- Chubby is explicitly designed for coarse-grained locking use cases (e.g. electing a primary)
- coarse-grained locks impose far less load on the lock server
- allow many clients to be served by a modest number of lock servers with somewhat lower availability
-
Two main components:
- server
- client: a library that client apps link against
-
All communications are done through the client library
-
consists of a small set of servers (e.g. 5), replicas
-
A master, elected using Paxos
-
Master lease: replica's promise to not elect other masters within a short time period (e.g. 5s)
-
Replicas maintain a copy of a simple database
-
Only master initiates reads and writes to the database, other replicas copy updates from the master, through Paxos
-
Client finds the master using DNS
-
All communicates are from clients to the master, master then propagates writes to other replicas through paxos; reads are served by the master as master lease is not expired
-
if master fails, other replicas run leader election when master lease expires
-
server replacement is done if a server does not respond in hours, updates DNS, master's cell's member table
Files, directories, and handles
- Chubby exports a file system interface, similar to UNIX
- The name space contains files and directories, called nodes
- Any node can act as an advisory reader/writer lock
- Each nodes has meta-data, including ACLs included to control reading, writing, and changing ACL permissions
- Clients open nodes to obtain handles, which is similar to UNIX file descriptors
Caching
- Chubby clients cache file data and node meta-data to reduce read traffic
- cache is consistent and write-through
- Master keeps a list of what clients may be caching
- Cache is either consistent or an error
Events
- When a client creates a handle, clients may subscribe to a range of events on a file
- events are delivered asynchronously
- What problem does this paper address?
- A coarse-grained locking service (e.g primary election) that is based on distributed consensus algorithm
- What is the author's main insight?
- Emphasize on availability and reliability
- use distributed consensus protocol to achieve fault tolerance
- consistent client side caching
- notification of updates
- familiar file system interface
- small storage for meta-data
- What are the paper's key strengths?
- High availability
- Used to elect primary
- A good name service
- What are the paper's key weaknesses?
- What problem does this paper address?
- The gap between Paxos in literature and in implementation
- What is the author's main insight?
- Handle membership Management, disk corruption, master lease, snapshot
- What are the paper's key strengths?
- What are the paper's key weaknesses?
Assumptions:
- no assumptions about faulty behavior
- asynchronous (cannot assume synchrony): no bounds on delays
- solution:
- ensures safety without synchrony: guarantee no bad replies
- eventual liveness: will reply when DOS attack ends
- solution:
Faulty Clients
-
all operations by faulty clients are observed in a consistent way by non-faulty clients
-
enforce access control
-
Faulty clients cannot break invariants
Algorithm Properties
- Arbitrary replicated service
- can have complex operations
- mutable shared state
- Safety and liveness:
- system behaves as correct centralized service
- clients eventually receive replies to requests
- Assumptions:
- 3f+1 replicas with f potentially Byzantine faults
- strong cryptography
- for liveness: eventual time bounds
Algorithm Setup
- State machine replication:
- deterministic replicas, start in the same state
- replicas execute the same requests in the same order
- correct replicas produce the same result
- View
- succession of configurations
- Each view has a designated primary
- Primary:
- responsible for receiving requests from clients
- primary is determined by view number
- picks ordering
- Backup:
- responsible for responding messages from primary
- ensures primary behaves correctly
- certify correct ordering
- trigger view changes to replace faulty primary
- Replicas remember decisions in log
- Messages are authenticated
Algorithm Overview
- A client sends a request to the primary
- primary multicasts the request to the backups
- Replicas execute the request and send a reply to the client
- The client waits for f+1 replies from different replicas with the same result, which is the actual result
Quorums and Certificates
- A quorum has at least 2f+1 replicas so that any two quorums will have at least one intersect
- certificate: set with messages from a quorum
- algorithm steps are justified by certificates
Phase 1: Pre-prepare
- After receiving a request from a client, the primary assigns a sequence number n to request m in view v
- Primary multicasts pre-prepare message to all other replicas
Phase 2: Prepare
- After receiving pre-prepare message from the primary, a backup sends a multicast prepare message to all other replicas
- In total, each replica receives 1 pre-prepare msg from the primary and 2f prepare message from replicas, so 2f+1 in total -> a quorum so a certificates
- *Note: no two different messages have the same sequence and view number, because it requires a quorum
Phase 3: Commit
- After receiving a quorum in prepare stage, both primary and backups multicast commit message to all other replicas
- Each replica will have 2f+1 commit msgs -> a quorum so a certificate
- Send result back to the client
- *Note: request n is executed when having certificate, and requests less than n have been executed
- When primary fails, timeout is triggered at every replica
- Each replica will send a view-change message to all other replicas
- When the new primary receives 2f+1 view-change msgs (a quorum), it sends a new-view message with signatures of all 2f+1 view-change msgs
- The new primary will start receiving client requests
Liveness
- When the primary fails, timeout triggers view changes
Checkpoint
- A checkpoint is the state immediately after every (e.g. 100) requests
- A stable checkpoint is a checkpoint with proof
- proof is achieved when a replica receives 2f+1 checkpoint message from different replicas, each with a signature
- when a new stable checkpoint is achieved, all messages of previous sequence numbers can be deleted
- What problem does this paper address?
- A consensus protocol in an async, Byzantine environment
- What is the author's main insight?
- 3 phase commit
- 3f+1 replicas: 1 primary, 3f backups
- Safety no matter what
- Liveness achieved if there is a synchrony bound
- What are the paper's key strengths?
- Practical BFT protocol in an async environment
- What are the paper's key weaknesses?
- not scalable, more replicas mean worse performance
- Hardware development is faster than OS development frequency
- Consolidate machines
- Isolate performance, security, and configuration
- Stay flexible
- Cloud computing
- Disco/VMware/IBM: complete virtualization - runs unmodified OSs and applications
- Use software emulation to shadow system data structures
- Need hardware support
- Paravirtualization - change interface to improve VMM performance/simplicity
- Must change OS and some apps
- Xen: change OS but not applications - support full Application Binary Interface (ABI)
- Faster than a full VM
Overheads
- privileged executions need to be emulated by the monitor
- Device I/O needs to me intercepted and remapped by the monitor
- Additional memory costs to run multiple OSes, including code and data, file system buffer cache
Resource Management
- Monitor does not have OS-level information on resource management
Communication and Sharing
- Sharing and communication becomes difficult, such as disk, file, etc
- Disco runs multiple independent VMs on the same hardware by virtualizing all resources of the machine
- Each VM can run a standard OS that manages its virtualized resources independently of the rest of the system
Processors
- The virtual CPU of Disco provide the abstraction of a processor
- Disco emulate all instructions, the memory management unit, and the trap architecture of the processor
- Allow unmodified applications and existing OSes to run on the VM
- Disco support efficient access to some processor functions
- e.g. kernel operations such as enabling and disabling CPU interrupts can be performed using load and store instructions on special addresses, reducing overheads by trap emulation
Physical Memory
- Disco provides an abstraction of main memory with contiguous physical address space starting at address zero
- Disco uses dynamic page migration and replication to export a nearly uniform memory access time memory architecture to the software -> allows non-NUMA-aware OS to run
I/O Device
- Each VM is created with a specified set of I/O devices
- e.g. disk, network interface, periodic interrupt timer, clock, console
- Disco virtualize each I/O device
- Disco intercept all communications to and from I/O devices to translate or emulate the operation
- Virtual disks can be configured to support different sharing and persistency models
- Monitor virtualizes access to the networking device, each VM is assigned a different link-level address. Disco acts as a gateway that send and receive packets using the machine's network interface.
- Multi-threaded shared memory program
- Extra attention given to NUMA memory placement, cache-aware data structures, and interprocess communication pattern
- Disco code copied to each flash processor
- Communicate using shared memory
Virtual CPUs
- Disco emulates the execution of the virtual CPU by using direct execution on the real CPU
- Pros: most operations run at the same speed as they would on the raw hardware
- Challenge: how to detect and fast emulate privileged instructions (not safely exported) by the OS: TLB modification, direct access to the physical memory, I/O device
- For each vCPU, Disco keeps a data structure that contains all state of the vCPU when it is not scheduled on a real CPU: saved registers, privileged registers, TLB contents
- Disco runs in kernel mode, OS runs in supervisor mode, o/w user mode
- Disco emulates the operations that OS cannot run in supervisor mode:
- when a trap such as page fault, system call, or bus error occurs, the processor traps to the monitor, which emulates the effect of the trap by updating the privileged registers of the virtual processor and jumping to the virtual machine's trap vector
Virtual Physical Memory
- Disco adds a level of address translation, maintains physical-to-machine address mapping
- VMs use physical addresses that start at address zero, contiguous to the size of VM's memory
- Disco maps these physical addresses to machine addresses
- Disco performs this translation using software-reloaded TLB, where when an OS attempts to insert a virtual-to-physical TLB entry, Disco uses physical-to-machine mapping to translate it into virtual-to-machine TLB entry.
- For each TLB entry later on, there is no extra overhead
- To avoid flushing TLB on MMU context switches within VM, each entry in TLB tagged with address space identifier
- However, must flush the real TLB on VM switch
- Somewhat slower:
- More TLB misses: TLB used for all VM OSs, TLB must be flushed on VM switch
- Each TLB miss is more expensive be cause of trap emulation, privileged execution emulation, and remapping of physical-to-machine
- Optimization:
- add a second-level TLB to store more virtual-to-machine translations
NUMA Memory Management
-
Non-uniform Memory Access: each processor has access its local memory (fast, e.g. cache) and other processor's memory (slow) or shared memory (slow). So the access time to different memory are non-uniform
-
Manage allocation of real memory to VMs
-
Disco needs to allocate memory and schedule vCPUs such that cache misses can be satisfied by local memory, instead of suffering additional latency of a remote cache miss
-
Disco targets cache coherence done by hardware, so NUMA memory management is only an optimization that enhances data locality
-
Dynamic page migration and page replication:
- Pages heavily accessed by only one node is migrated to that node
- Read-shared are replicated to nodes heavily accessing them
- Write-shared are not moved, b/c remote access is still required anyway
- limit the number of times a page can be moved (limit overhead)
-
To migrate, first invalidate all TLB entries, then copy to local memory
-
To replicate, first downgrade all TLB entries to read-only, then copy and update TLB entries
![image-20210712121248502](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210712121248502.png)
Virtual I/O Devices
-
Main task: translate all I/O instructions from using physical memory addresses to machine memory addresses
-
Emulated all programmed I/O instructions
-
copy-on-write disk blocks - share OS pages:
- Track which blocks are already in memory
- If a block is already in memory, reuse it by marking all versions read-only, and use copy-on-write if they are modified
![image-20210712134035438](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210712134035438.png)
Virtual Network Interface
- Within a physical machine, zero-copy networking allows fake subnet
- sender and receiver can use the same buffer
- the monitor's networking device remaps data page from source's machine address to destination's
![image-20210712134631752](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210712134631752.png)
- Disco VMM hides NUMA-ness from non-NUMA aware OS
- Disco VMM is low effort (only 13K LoC)
- Moderate overhead due to virtualization
- 16% overhead for uniprocessor workloads
- System with 8 virtual machines can run some workloads 40% faster
- What problem does this paper address?
- Hardware iterates faster than software (OS), how to make OS compatible with existing and future hardware?
- What is the author's main insight?
- Add a layer between hardware and OS
- Virtualize all resources of machine
- What are the paper's key strengths?
- Consolidate machines
- isolate performance, security, and configuration
- Moderate overhead of 16%
- What are the paper's key weaknesses?
- Number of VMs is small
- Still moderate overhead
- Xen, a VMM that allows multiple OSs to share conventional hardware in a safe and resource managed fashion, but without sacrificing either performance or functionality
- Provides an idealized virtual machine abstraction with minimal porting effort
- Hosting ~100 VM machines with 2-5% performance overhead
- Aim to run industry standard applications and services
- Isolate performance from each other, given mutually untrusted users
- Support a variety of OSs
- Small performance overhead
Drawbacks of Full Virtualization
- Supervisor instructions must be handled by VMM for correct virtualization, increased complexity and cost
- It may be desirable that guest OS knows both real and virtual resources (e.g. time)
- Moderate overhead
x86 Virtualization
- MMU uses hardware page tables
- Some privileged instructions fail silently instead of fault
- VMWare fixed this using binary rewrite
- Xen by modifying OS to avoid them
Paravirtualization
- Require modification to the OS (but not apps)
- Improve performance, resource isolation
![image-20210712162043203](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210712162043203.png)
Memory Management
- x86 has hardware-managed TLB
- No tagging allowed in x86, address space switches require a complete TLB flush
- Each TLB miss is served by the processor by walking the page table structure in hardware
- All valid page translations for the current address space should be present in hardware-accessible page table
- Solution:
- guest OSs responsible for allocating and managing hardware page tables. Need Xen to validate entries before write; guest OS has read-only access
- Xen exists in the top of every address space, so avoiding TLB flushes when entering and leaving the hypervisor
CPU
- OS needs to run at a lower privilege level
- VMM in ring 0, OS in ring 1, Apps in ring 3
- Privileged instructions by guest OS are required to be validated and executed within Xen
- Exceptions (e.g. memory faults, software traps) are virtualized by a table describing the handler for each type of execution. The table is registered with Xen for validation.
- Two most frequent exceptions: system calls, page faults
- page fault must be delivered to Xen because it requires ring 0 to read the faulting address from register CR2
- improve performance of syscalls by allowing each guest OS to register a "fast" exception handler that is accessed directly by processor without requiring ring 0 indirect
- Safety is ensured by validating exception handlers when they are presented to Xen
- Each guest OS has a timer interface and is aware of both real and virtual time
Device I/O
- Xen exposes a set of device abstractions
- I/O data is transferred to and from each domain via Xen, using shared-memory, asynchronous buffer-descriptor rings
- high-performance
- passing buffer information vertically through the system
- validation check from Xen possible
- Hardware interrupts are replaced with a lightweight event system
- Hypervisor itself only provides basic control operations
- Run the VMM management at user level
- Domain0:
- manage the entire server
- given special access to control interface for platform management
- Ability to create and terminate other domains, control their associated scheduling parameters, physical memory allocations, access to physical disks and network devices
Control Transfer: Hypercalls and Events
- VM to Xen: synchronous hypercall
- software trap to Xen to perform privileged operations
- e.g. page-table updates
- Xen to VM: async event mechanism
- lightweight notification of important events
- e.g. domain-termination requests
Data Transfter: I/O Rings
- necessary to have a data transfer mechanism that allows data to be moved vertically through the system with as little overhead as possible
- Asynchronous I/O rings are used to transfer data between Xen and guest OSs
- zero-copy semantics
- A pair of consumer-producer pointers for each direction
- The structure is generic
CPU Scheduling
- Fair sharing scheduler: effective isolation between domains
- However, allows temporary violations of fair sharing to favor recently-woken domains
- Reduce wake-up latency to improve interactivity
- Fast event handling by domains
Virtual Address Translation
- Shadow page table is too costly: Xen needs to intercept and add physical -> machine translation for every change
- Xen only involve in page table updates (to prevent guest OSes from making unacceptable changes)
- register guest OS page tables directly with the MMU
- restrict guest OS to read-only access
- complete page table updates are passed to Xen, which will only need to validate before applying
- Validation:
- associate a type and reference count with each machine page frame
- Xen can ensure that only validated pages are used for the HW page tables
- To minimize the number of hypercalls, guest OS can batch updates before applying an entire batch with a single hypercall
Physical Memory
-
physical memory is reserved at domain creation time
- statically partitioned among domains
-
guest OS has a physical -> machine mapping, allowing contiguous physical address space
-
Xen has a table to map hardware to physical
Network
- Xen offers a set of Virtual Firewall/Router (VFR)
- Each guest OS has one or more virtual interface (VIF) that is assigned to a VFR
- the VFR both limit the guest OS and ensures correct incoming packet dispatch
- Exchange packet's page to a destination VM's free page on packet receipt
- no copying required
- Bandwidth is round robin
Disk
- Xen provide abstraction of Virtual Block Device (VBD)
- A translation table is maintained within Xen for each VBD
- entries are installed and managed by Domain0
- On receiving a disk request:
- Xen inspects the VBD identifier and offset
- produces the corresponding sector address and physical device
- Zero-copy data transfer done using DMA between the disk and pinned memory pages in the requesting domain
Time and timers
-
Times:
- Real time: time since machine boost
- Virtual time: time that only advances in the context of VM
- wall-clock time
-
Each guest OS can program timers for both real and virtual time
- What problem does this paper address?
- A VMM that has minimal performance overhead, generic architecture requirement, isolate performance, security, configuration?
- What is the author's main insight?
- Minimally alter guest OSes to make VMs simpler and higher performance
- Run admin at user level: Domain0
- I/O rings
- What are the paper's key strengths?
- Isolation of performance, security, and configuration
- Support hardware-managed TLB; no TLB entry tagging
- I/O rings support zero-copy communication
- negligible overhead
- What are the paper's key weaknesses?
- Communication between domains?
- Load balancing for long-lived jobs
- System maintenance: controlled maintenance window
- Fault tolerance: move job away from flaky (yet still alive) servers
- Cost efficiency: rearrange load
- VM migration does not involve residual dependency:
- where original machine needs to stay alive and network-accessible for the migrated process (e.g. open file descriptors, shared memory segments, local resources, syscalls, memory accesses)
- VM migration allows the original machine to shutdown once migration is complete, allowing maintenance
- VM migration allows migrating in-memory state:
- in-memory state can be migrated in a consistent and efficient fashion
- applies to kernel-internal state (e.g. TCL control block for an active connection) and application-level state, even if this is shared between multiple cooperating processes
- Migrate without requiring clients to reconnect (impossible for application-level restart and layer 7 redirection)
- Separate concerns between the user and the operator:
- The user does not need to give operator any OS-level access permissions
- The operator does not need to know what is happening in the VM, just migrate the entire OS and its attendant processes as a single unit
- Low downtime: during which services are not available
- Low total migration time: during which state on both machines are synchronized, may affect reliability
- Disruption to active service: does not unnecessarily disrupt active services through resource contention (e.g. CPU, bandwidth) with the migrating OS
- Focus on physical resources: memory, network, and disk
Memory Migration Options
- Push phase: source VM continues running while copying certain pages across the network to the destination VM. To ensure consistency, any modified pages need to be resent.
- Stop-and-copy phase: stop the source VM, copy pages to the destination VM, start destination VM
- Pull phase: the new VM executes, when a non-copied page is accessed, page fault happens across the network and copied in from source VM
Pre-Copy Migration
- A bounded iterative push phase
- all pages are copied in the first round
- for other rounds, only copy modified pages
- Bound the number of rounds based on writable working set (WWS) analysis
- A typically very short stop-and-copy phase
- Carefully control network and CPU resources used
Local Resources
- Network resource:
- maintain all open network connections
- a migrating VM will include all protocol state, and will carry IP address
- Usually connected in a single switched LAN, new VM just broadcast its new physical location
- Disk
- Network-attached storage (NAS), no storage migration considered
-
Pre-Migration: begin with an active VM on physical host A. Preselect a target host B where resources required is guaranteed
-
Reservation: A sends a request for B, reserving a VM container at B
-
Iterative Pre-Copy: first round copy all pages; consecutive rounds copy dirtied pages
- Can increase bandwidth used for later iterations to reduce time during which pages are dirtied
-
Stop-and-Copy: suspends A, redirect network traffic to B, copy all CPU state and inconsistent memory pages over. At the end of this stage, consistent copies at both A and B
-
Commitment: B notifies A that a complete OS image is received. A sends ack back. A can now abort, B becomes the primary host.
-
Activation: B is activated. Post-migration code attach device drivers to the new machine and broadcast moved IP address (update IP address -> MAC address translation).
- The largest overhead is coherently transferring virtual machine's memory image
- Two kinds of pages:
- Seldom or never be modified - good to pre-copy
- Often updated (Writable Working Set) - good to stop-and-copy
- Different programs have different WWS behavior, but most are generally amenable to live migration
Tracking WWS
- Xen inserts shadow pages under the guest OS, populated using guest OS's page tables
- If a guest OS tries to modify a page of memory, the resulting page fault is trapped by Xen, so tracked by the shadow pages
- Managed migration:
- move the OS without its participation
- can add some paravirtualization to stun processes that dirty pages too frequently or move unused pages out to reduce pages moved
- Self migration:
- OS participates in migration
- hard to get a consistent OS because OS is participating
- What problem does this paper address?
- How to migrate VM with minimized downtime, low total migration time, and no unnecessary disruption to active services?
- What is the author's main insight?
- Bounded iterative pre-copy, and then stop-and-copy
- Track Writable Working Set to reduce unnecessary copies
- What are the paper's key strengths?
- very low downtime (in tens of ms)
- low total migration time (in s)
- no unnecessary disruption
- once migration is complete, the original VM can shutdown
- What are the paper's key weaknesses?
- Does not address storage
- If some workload has very frequent dirty page percentage, turns out to be pure stop-and-copy
- Isolate and understand the overhead introduced by VM and containers, compared to non-virtualized Linux
- Identify the primary performance impact of each virtualization option
Motivation and Requirements for Cloud Virtualization
- Unix's shared global filesystem lacks configuration isolation
- Multiple applications can have conflicting system-wide configuration settings (e.g. shared library dependencies)
- Administrators and developers simplify deployment by installing each application in a separate OS copy, either a new machine or a new VM
- Customers want to get the performance they paid for, so provision fixed units of capacity with no oversubscription
Kernel Virtual Machine (KVM)
- KVM is a feature of Linux that allows Linux to act as a type 1 hypervisor
- run unmodified guest OS in a Linux process
- add hardware acceleration and paravirtual I/O to minimize virtualization overhead
Linux Container
- container-based virtualization modified an existing OS to provide extra isolation
- built on the kernel namespaces feature: allows creating separate instances of previously-global namespaces
- by adding a container ID to every process, add new access control to every syscall
- usually create a container that has no visibility or access to objects outside the container
- processes running in a container appear to run on a normal Linux system, yet they share underlying kernel with processes from other namespaces
- Can contain as large as a system (system container) or as little as a process (application container)
- Application container consumes less RAM than system container or VM because it does not contain redundant management processes
- Application containers do not have separate IP addresses
VM vs. Container
Kernel Virtual Machine (KVM) | Container (Docker) | |
---|---|---|
Resource Isolation | - VM has a static number of vCPUs and fixed amount of RAM - cannot be resized |
- use cgroups to group processes and manage aggregated resource consumption - can be resized - apps don't know their resource limit |
Management | - easy to use via management tools (e.g. libvirt) | - Several management tools are available, e.g. Docker (i.e. has layered filesystem image) |
Scheduling | - Two levels: VM and VMM | - Only one level (Linux OS) of resource allocation and scheduling |
Performance Consistency | - vCPU can be descheduled, RAM can be swapped out, without any notice | - Applications may not know its container's limit |
Security Isolation | - Restrict VM's communication with outside world: limited number of hypercalls or emulated devices (via hypervisor) | - Cannot see/access what is in other namespaces (container) - root within one namespace is not root in another namespace - Share IP address with other containers - non-namespace-aware syscalls can introduce accidental leakage |
Communication | - All communications through VMM, hard to communicate | - Easy to communicate among containers and between container and the host |
Overhead | - add overhead when sharing data between VMs or between guest and VMM - in the cloud, VM access storage through emulated block devices - hardware acceleration - paravirtual I/O devices |
- Startup time for container is way less than VM |
- VM: CPU performance loss for workloads that depend on hardware-specific optimizations
- VM: memory performance loss due to nested paging, which makes TLB misses more costly
- VM: disk/network I/O performance loss due to emulated I/O instruction
- What problem does this paper address?
- Compare and contrast between VM and Container.
- What is the author's main insight?
- VM has more performance overhead due to its lower-level of isolation
- Container is more performant, easier to deploy, higher elasticity
- VM is more secure
- What are the paper's key strengths?
- Demonstrates the workloads where VMs perform badly compared to containers
- What are the paper's key weaknesses?
- Compare and contrast, so N/A?
- Currently, cloud providers support mobile application's backend through datacenters, which is centralized.
- End-to-end communications require many hops, so high latencies and low bandwidth
- Cloudlet reduces the number of hops by locating close to a cellular base station
- Workload aimed by cloudlets:
- resource-intensive
- interactive
Similarity
- Need strong isolation between user-level applications
- Mechanism for authentication, access control
- Dynamic resource allocation for user-level applications
- Support a variety of OS, applications
- VM used for both cloud and cloudlet
Differences
- Rapid provisioning: cloudlet needs to be more agile, highly dynamic connection and setup
- VM Migration: when move from cloudlet to cloudlet, VM needs to be migrated, in a WAN fashion
- Stateful swift cloning of VMs
- State inherited up to the point of cloning
- Local modifications are not shared
- VM cloning makes an impromptu cluster
Cloudlet Customization
![image-20210714165926819](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210714165926819.png)
- Cloudlet preload base VM
- mobile device provides VM overlay to run its customized VM
- reduce setup time
- How to achieve persistent storage?
- Send local storage volume to cloudlet at the beginning and receive changes at end of computation?
- How to find and negotiate for computational resources?
- Security of physical cloudlet machines?
- What problem does this paper address?
- How to reduce latency and improve bandwidth for resource-intensive, interactive mobile workloads?
- What is the author's main insight?
- Use cloudlet that is close to cellular tower
- 1-hop latency
- use VM overlay to achieve fast VM launch
- What are the paper's key strengths?
- low latency, high bandwidth
- fast VM startup
- What are the paper's key weaknesses?
- security, persistent storage, cloudlet negotiation
- Cluster computing frameworks are used to deploy different applications
- Different frameworks are suitable for different applications, there is no framework that suits all
- Within each framework, there is fine-grained resource sharing that allows data locality, scalability, high utilization.
- However, there is no way to perform fine-grained resource sharing across frameworks as they are deployed independently
- Mesos is a platform to share a commodity cluster with multiple cluster computing frameworks, where each framework is the most appropriate one for each application
- Improve cluster utilization
- Applications can share access to large datasets that are costly to replicate at a per-application basis
Challenge
- How to build a thin layer that is scalable and efficient
- Each framework has different scheduling needs (due to programming model, communication pattern, task dependencies, data placement)
- Must scale to clusters of tens of thousands of nodes, hundreds of jobs, millions of tasks
- Need to be fault tolerant and highly available as all frameworks depend on it
Limitation of a centralized scheduler
- A centralized scheduler: takes input on framework requirements, resource availability, organizational policies, then computes a global schedule for all tasks
- Complexity: need sufficiently expressive API to satisfy all frameworks' requirements
- Comprehensiveness: if new requirement/policy comes out, need to include all
- Framework's scheduling: frameworks have sophisticated scheduling already, not feasible to port them
- Mesos provides a new abstraction called resource offer, which encapsulates a set of resources a framework can allocate on a cluster node
- Mesos decides how many resources to offer
- Frameworks decides which resources to accept and which tasks to perform
- Properties:
- Not globally optimal, but works very well;
- simple and efficient to implement
- Scalable and fault tolerant
- Design philosophy is to design a minimal interface that:
- enables efficient resource sharing across frameworks
- push task scheduling and execution to the framework
- Pros: allow frameworks to implement their own approaches, keeps Mesos simple and minimize the rate of change required
- Mesos consists of a master node that manages slave daemons on each worker node, and frameworks run tasks on worker nodes
Resource Allocation
- The master implements fine-grained sharing across frameworks using resource offers
- each resource offer is a list of free resources on multiple slaves
- the master determines how many resources to offer each framework, according to an organizational policy (e.g. fair sharing)
- Mesos let organizations define their own inter-framework allocation policies
- Can kill tasks if they are too long
- Each framework on Mesos consists of:
- a scheduler, registers with master to be offered resources,
- selects which of the offered resources to use, pass Mesos a description of the tasks it wants to launch.
- Reject if resource not wanted.
- allow filters to optimize
- delay scheduling to acquire nodes storing their data
- Can indicate interests to have more resources
- an executor process on each slave node, used to run the framework's tasks
- a scheduler, registers with master to be offered resources,
Isolation
- Use OS Container to isolate performance, resources
Fault Tolerance
- Master hold soft state only: active slaves, active frameworks, running tasks
- sufficient to compute how many resources each framework is using and run the allocation policy
- A new master can completely reconstruct master's internal state from information held by the slaves and framework schedulers
- Multiple masters using a hot-standby configuration using ZooKeeper
- Performs better with elastic framework, homogeneous task duration, and prefer all nodes equally
- What problem does this paper address?
- How to enable fine-grained resource sharing across cluster computing frameworks?
- What is the author's main insight?
- Provide a thin layer between frameworks and cluster nodes
- Mesos provides resource offer
- Frameworks are responsible for accept/reject offers, schedule tasks, and execute tasks
- What are the paper's key strengths?
- Increase resource utilization
- Applications can share large datasets on the cluster instead of replicating them
- Can satisfy a diverse set of frameworks
- high scalability, fault tolerance, and high availability
- What are the paper's key weaknesses?
- Not globally optimal scheduling
Borg
- Manage resource sharing between long-running latency-sensitive services and batch jobs
Omega
- Improve the software engineering of Borg
- Omege stores the state of the cluster in a centralized Paxos-based transaction oriented store
- Use optimistic concurrency control to handle the occasional conflicts
- Decouple Borg's functionality into separate components that act as peers instead of a monolithic centralized master
- resource isolation
- cannot provide isolation for resources that OS kernel doesn't manage (e.g. level 3 cache, memory bandwidth)
- Modern container provides image (files that make up the application that runs inside the container)
- a hermetic container image that encapsulates almost all of an application's dependencies into a package that can be deployed into a container
- Only local external dependencies will be Linux kernel system-call interface
- Container transforms data center from machine-oriented to application-oriented
- Container abstracts away machine and OS from the application developer
- improves application development
- allow faster development of hardware
- Monitor/Log application instead of machine
- Easier to build, manage, and debug applications
Design
- Outermost container is called a pod, always run an application container inside of a top-level pod
- Each pod can hold an instance of a complex application, where each part of the application sits in one of the pod's child containers
- Choreography: achieving a desired emergent behavior by combining the effects of separate, autonomous entities that collaborate
- in contrast to a centralized orchestration system, which may be easier to construct at first, by become brittle and rigid over time
- What problem does this paper address?
- How to provide virtualization with efficient resource allocation, application-oriented approach, and auto-scaling for long-running jobs and batch jobs altogether?
- What are the paper's key strengths?
- What are the paper's key weaknesses?
- row store: most databases are record-oriented storage systems, where the attributes of a record are place contiguously in storage
- a single disk write suffices to push all of the fields of a single record out to disk
- write-optimized, high write performance
- insert or delete record in one physical write
- Effective for on-line transaction processing (OLTP), where
- transactions involve small numbers of records
- Frequent updates
- Many users
- Fast response time
- Easy to add new record, but might read unnecessary data (wasted memory and I/O bandwidth)
- column store: the values for each single column are stored contiguously
- Read-optimized, oriented toward ad-hoc querying of large amounts of data
- Effective for On-Line Analytical Processing (OLAP), where
- transactions involve large numbers of records
- Frequent ad-hoc queries and infrequent updates
- A few decision-making users
- Fast response times
- Read-mostly
- Intuition: Only read relevant columns
- A DBMS only reads the values of columns required for processing a given query, avoiding bring into memory irrelevant attributes
- In data warehouse environment where typical queries involve aggregates performed over large numbers of data items, a column store has sizeable performance advantage
- Cons:
- tuple write needs multiple seeks
- Projection:
- C-store physically stores a collection of columns, each sorted on some attribute
- A group of columns stored on the same attribute is a projection; the same column may exist in multiple projections
- Tuple mover:
- A Writable Store (WS) component that provides efficient write operations
- A Read-optimized Store (RS) component that provides efficient bulk read
- A tuple mover that periodically moves new data from WS to RS
- Consistency:
- snapshot isolation
- Column-oriented optimizer and executor
Trade CPU cycles for Disk Bandwidth
- Code data elements into a more compact form
- Densepack values in storage (compression)
- What problem does this paper address?
- How to support read-optimized workload?
- What is the author's main insight?
- Use column store instead of row store
- Do not read unused data
- Trade CPU cycles for disk bandwidth, operate on compressed data
- What are the paper's key strengths?
- Faster for read-only workloads
- Both WS and RS to support efficient read and write workloads
- Use snapshot isolation to provide lock-free fast queries
- What are the paper's key weaknesses?
- Tuple mover may impact performance when the database is constantly at high workload
- Many DBMS perform well and efficiently only after tuned by a DBA
- DBA decides:
- Which indices to build?
- On which data parts?
- When to build?
- Timeline:
- Sample workload
- Analyze performance
- Prepare estimated physical design
- Queries
- DBA decides:
- Tuning by DBA is a complex and time-consuming process
- Hard to do for dynamic workload or a very large database
- Design new auto-tuning kernels
- continuous on-the-fly physical reorganization
- Partial, incremental, adaptive indexing
- For modern column stores
- Each query is used as advice on how data should be physically stored
- triggers physical re-organization of the database
- The first time a range query is posed on an attribute A, a cracking DBMS makes a copy of column A, and call it cracker column A
- A cracker column is continuously physically reorganized based on queries that need to touch attribute A, such that the result is in a contiguous space
- For each cracker column, there is a cracker index
![image-20210721162220068](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210721162220068.png)
Cracker Select Operator
- Searches the cracker index
- Physically re-organize the pieces found
- update the cracker index
- Return a slice of the cracker column as a result
- Faster because analyzes less data (even though more steps)
- What problem does this paper address?
- How to physically store tuples such that no DBA is required to support fast queries?
- What is the author's main insight?
- Continuously physically re-organize database based on queries
- Use queries as evidence on where to store tuples
- What are the paper's key strengths?
- Does not require DBA tuning for efficient DBMS performance
- Can adapt to dynamic workload or large database
- What are the paper's key weaknesses?
- DBA may have a better knowledge
- May not converge to the efficient layout
- A fundamental problem of peer-to-peer applications is to efficiently locate the node that stores a particular item, given that nodes are dynamically changing
Overview
-
Chord is a distributed lookup protocol that does one thing: given a key, it maps the key onto a node
- Chord adapts efficient as nodes join and leave the system
- Can answer queries even if nodes are continuously changing
- scalable: communication cost and the state maintained by each node scales logarithmically with the number of Chord nodes
-
Chord is a variant of consistent hashing:
- Load balancing as each node receives roughly the same number of keys
- Little key movement when nodes join and leave the system
-
In steady state:
- Chord node only needs information on O(logN) nodes
- a node resolves all lookups by communicating with O(logN) nodes
-
When node join and leave:
- Chord maintain its routing info as nodes join and leave
- high probability that each event results in no more than O(log^2(N)) messages
-
Chord is simple, provable correct, provable performant
-
A node only needs one piece of information to ensure correctness, although slow
- Load Balancing: natural load balancing provided by consistent hashing, keys are evenly spread over nodes
- Decentralization: all nodes run the same software, no node is more important than others. Better robustness. Appropriate for loosely-organized peer-to-peer network
- Scalability: Lookup cost (space, communication) is logarithmically to the number of nodes
- Availability: highly available, allow continuous changes to internal tables to reflect node joins and leaves. Except for major network failure, the responsible node can always be found
- Flexible Naming: no naming structure is required. A flat namespace.
Consistent Hashing
- Key and Node ID are generated by hashing the key and the node IP
- ID are ordered in an identifier circle module 2^m
- Each key is assigned to the node whose ID is the same or follows the key ID
- Assume the key load on a node may exceed by at most O(logN)
Finger Table
- Each node keeps a routing table with (at most) m entries, where m is the length of ID
- The ith entry at node n contains the ID of the first node that succeeds by at least 2^(i-1) on the identifier circle
- A finger table entry contains the node ID and its IP+port
- Two characteristics:
- One node knows the nodes following it better than further away ones
- Information on each node is not enough to resolve most keys
- To search for a key, find in finger table for the node that is greater than its own ID and smaller than the key ID, ask it
- If none, then return successor
Node Join
- Two invariants for correctness:
- Each node's successor is correctly maintained
- For each k, successor(k) holds k
- Correct finger tables is for fast lookup
- Each node maintains a predecessor pointer, which points to the node immediately before it
- Three tasks when a node n joins the network:
- Initialize the predecessor and finger table of node n
- Update the predecessor and finger table of existing nodes to reflect addition of n
- Notify the higher layer software so that it can transfer values associated with keys that node n is now responsible for
- Assume new node n knows an existing node n' through external mechanism
- Initialize fingers and predecessor:
- Node n knows its predecessor and fingers by asking n' to look them up
- For each of fingers, n' performs a lookup. An optimization is to check if ith finger is the same as (i+1)th finger
- Overall time becomes O(log^2(N))
- Update fingers of existing nodes:
- Node n will be the ith finger of node p if and only if
- p precedes n by at least 2^(i-1)
- the ith finger of node p succeeds n
- Algo starts with the ith finger of node n, and then continues to walk in the counter-clock-wise direction on the id circle
- O(log^2(N))
- Node n will be the ith finger of node p if and only if
- Transfer keys:
- Contact the node immediately follows n only
- move the data associated with keys to n
Stabilization
- To ensure correctness, stabilization protocol keeps nodes' successor pointers up to date
- Those successor pointers are then used to verify and correct finger table entries
- Every node runs stabilize periodically, which asks n's successor for its predecessor p, and decide if p should be n's successor instead
Failures
- Each node keeps a successor-list that tracks r nearest successors on the Chord ring
- If the node fails, contact the next successor node on the list
- For node far away, use finger table to contact a closer node
Replication
- Use successor-list to store replicas on nodes succeeding n
- What problem does this paper address?
- A simple, provable correct, provable performant distributed lookup protocol for decentralized applications?
- What is the author's main insight?
- Consistent hashing
- Finger table
- Separate performance and correctness
- What are the paper's key strengths?
- simple, provable correctness, provable performance
- Separate correctness and performance
- What are the paper's key weaknesses?
- Latency vs. memory: finger table -> more memory and less latency
- fixed ID size: could be problematic if variance in network size
- Dynamo is a highly available data storage
- A simple key/value interface
- highly available with a clearly defined consistency window, efficient in its resource usage, a simple scale out scheme
- Data is partitioned and replicated using consistent hashing
- Consistency is facilitated by object versioning
- Consistency among replicas is maintained by a quorum-like technique and a decentralized replica synchronization protocol
Query Model
- Simple read and write operations to a data item that is uniquely defined by a key
- state is stored as binary objects identified by unique keys
- No range operations
- No relational schema
- Target applications that store small objects
ACID Properties:
- ACID (Atomicity, Consistency, Isolation, Durability)
- Target applications that operate with weaker consistency, but high availability
- Dynamo does not provide isolation guarantees
- Only single key updates
Efficiency
- System needs to run on commodity hardware
- Storage system must meet stringent SLAs
- Tradeoffs are performance, availability, cost efficiency, and durability
Other Assumptions:
- No security related requirements
- Needs to be highly scalable - hundreds of storage hosts
Consistency vs. Availability
- There is a tradeoff between consistency and availability when network failure happens
- Traditional systems ensures consistency, where data is made unavailable until it is absolutely certain it is correct
- Optimistic Replication Technique:
- Increase availability by propagating changes to replicas in the background, and concurrent, disconnected work is tolerated
- challenge: leads to conflicting changes, which need to be detected and resolved
- Question: when and who to resolve conflicts?
- Dynamo is designed to be an eventually consistent data store: all updates will reach all replicas eventually
- When:
- Should conflicts be resolved during reads or writes?
- If during write: writes could be rejected when it cannot reach to all replicas, hurts availability; simpler reads
- If during read: highly available for writes; complex reads
- Who:
- Should conflicts be resolved by application or data store?
- if Data Store: does not have information from application level, can only use simple techniques such as "last write wins"
- if application: can have higher-level resolve techniques based on application need
Incremental Scalability
- Dynamo should be able to increment by one host at a time with minimal impact on the system operator and the system itself
Symmetry:
- All hosts should have the same responsibilities as their peers; no distinguished node
- This reduces the cost of provisioning and maintenance
Decentralization
- In addition to symmetry, the design should favor decentralized peer-to-peer techniques instead of centralized control
- System is simpler, more scalable, and more available
- Centralized control may result in outages
Heterogeneity
- The system should exploit the heterogeneity of individual hosts
- Work distribution should leverage the capability of hosts
- Core distributed system techniques: partitioning, replication, versioning, membership, failure handling, scaling
System Interface
- get(key): returns the object associated with the key or a list of objects, each with conflicting versions and a context
- put(key, object, context): uses key to determine where to place replicas of the object. Context includes system metadata on the object, version of the object, etc.
- Context is stored along with object, so the system can verify that the put caller has validity
- Dynamo treats both key and object as an opaque array of bytes
- Hash of key is used to determine which nodes are responsible for storing/serving the key
Partitioning
- Challenges of basic consistent hashing:
- Random positioning of nodes may result in non-uniform data and load distribution
- Does not account for heterogeneity in nodes (good nodes should take more load)
- Consistent hashing with virtual node:
- each physical node can take multiple virtual nodes
- advantages:
- when a node leaves, its workload is evenly distributed to others
- when a node joins, it takes roughly equivalent workload as others
- Good nodes can take more virtual nodes (heterogeneity)
Replication
- The coordinator stores the key it is responsible for, and replicate it to N successor nodes
- Preference list: the list of distinct physical nodes responsible for storing a particular key
Data Versioning
- Dynamo provides eventual consistency, where updates can be propagated to all replicas asynchronously
- a subsequent get() may not return the most recent put()
- Each modification is treated as a new and immutable version of the data
- Allows multiple versions of the same data to be present
- Version branching:
- failures + concurrent updates: conflicting versions may happen
- need client to merge and collapse multiple branches
- Applications need to explicitly acknowledge multiple versions
- Vector clock:
- Use vector clock to capture relationships between versions (parallel vs. causal)
- when a client updates, it must specify which version it is updating using the context parameter
- If multiple versions of an object exist, and cannot be syntactically reconciled, get() will return all existing objects with their context
- client is responsible for semantically reconcile them
- To limit the amount of (node, counter) pairs, remove the oldest pair when a threshold is reached
Execution of put() and get() requests
- Client either sends requests to 1) a load balancer that routes to a random node; 2) to the coordinator (lower latency)
- The coordinator is the first of top N nodes in the preference list
- Consistency protocol R and W:
- R: the minimum number of nodes that must participate in a get() request
- W: the minimum number of nodes that must participate in a put() request
- If R + W > N, a quorum-like system
- total latency is the latency of the slowest node
- put():
- coordinator stores locally
- coord sends to N highest-ranked reachable nodes
- Wait for W-1 responses
- return
- get():
- coordinator sends to all N highest-ranked reachable nodes
- wait for R responses
- return (send all conflicting versions)
Handling Failures: Hinted Handoff
- Hinted handoff:
- When a node (e.g. A) is unavailable during a write, a temporary node (e.g. D) will receive the replica. This ensures availability and durability guarantees.
- The hinted replica is stored on a separate local storage, and knows it is supposed to be on A
- When A comes back, D sends the replica to A, and delete its own copy
- To ensure highest availability, set W = 1
- To have better durability, set greater W value
- Tolerate datacenter outage by setting preference list to include multiple datacenter nodes
Handling Permanent Failures: Replica synchronization
- To handle threats to durability, Dynamo implements a replica synchronization protocol
- Use one Merkle tree per virtual node's key range, each leaf is the hash of a key's value
- Two hosting nodes can check if they have the same replicas by exchanging and comparing the root hash
- if different, recursively does it to its children, eventually find the key differences
- Merkle tree advantages:
- each branch can be checked individually without requesting other branches
- reduce the data exchanged
Membership and Failure detection
- Assume outages are mostly temporary, use explicit membership addition or removal
- The node that receives the request to add/remove stores it to persistent storage
- Nodes periodically exchange, wants eventual consistency on view:
- membership history information
- mappings between virtual nodes and key ranges (allows direct forwarding of requests)
- Purely local notion of failure detection:
- a node sees another node as temporary down if not responding messages
- repeatedly check unavailable nodes to see when it becomes alive
- What problem does this paper address?
- How to design a highly available decentralized data storage system?
- What is the author's main insight? What are the paper's key strengths?
Problem | Solution | Advantages |
---|---|---|
Partitioning | Consistent hashing | Incremental scalability |
Highly available for writes | data versioning with vector clock | always writable, client resolves semantic conflicts |
Handling temporary failures | sloppy quorum and hinted handoff | high availability and durability during node failures |
Recovering from permanent failures | replica synchronization through merkle tree | synchronize in the background with minimal data exchanges |
Membership and failure detection | gossip-based protocol | avoid a central identity to manage membership and failures |
- What are the paper's key weaknesses?
- No security aspects?
- Tapestry is a P2P overlay network that provides high-performance, scalable, and location-independent routing of messages to close-by endpoints, using only localized resources.
- Core is a Decentralized Object Location and Routing (DOLR)
- Aim for efficiency:
- minimize message latency
- maximizing message throughput
- Tapestry exploits locality in routing messages to mobile endpoints such as object replicas
- Use adaptive algorithms with soft state to maintain fault tolerance under changing node membership and network faults
- Tapestry virtualizes resources, objects are known by name instead of locations
Routing Mesh
- Each node stores a neighbor map
- Each level stores neighbors that match a prefix up to a certain position in the ID
- Construct locally optimal routing tables from initialization and maintains them in order to reduce routing stretch
- Each ID is mapped to a live node called the root
- Each message is aiming to reach the ID
- There are multiple hops for a message to find the root node:
- At each hop, a message is progressively routed closer to G by incremental prefix routing
- Each neighbour map has multiple levels, each level contains links to nodes matching up to a certain digit position in the ID
- e.g. level 1 has links to nodes that have nothing in common, level 2 has first digit in common, etc.
- Routing takes O(log_B(N)), N is the namespace size, B is the ID base (e.g. hex=16)
Object Publication and Location
- Participants in the network can publish objects by periodically routing a publish message toward the root node
- Each node on the path store the pointer mapping the object
- Objects are located by routing a message towards the root of the object
- Each node along the path checks the mapping and redirects the request appropriately
- What problem does this paper address?
- How to provide efficient, location-aware distributed routing?
- What is the author's main insight?
- Routing table with neighbours that have x digits of prefix.
- Every hop is progressively closer to the root
- locality-aware routing
- What are the paper's key strengths?
- locality-aware
- efficient
- adapt to lower-level network failures
- What are the paper's key weaknesses?
- Algo is more complicated than Chord etc.
- infrastructure is untrusted except in aggregate
- infrastructure is constantly changing without notices
- A data object is similar to a file in a traditional file system
- Data objects are ordered sequences of read-only versions
- Each version:
- contains metadata, data, reference to previous versions
- stored in a data structure similar to a B-tree, where a block reference each child by a cryptographically-secure hash of child's content (i.e. Block GUID)
- Version GUID is the root block's BGUID
- copy on write
Application-specific Consistency
- Updates are applied atomically and are represented as an array of potential actions, each guarded by a predicate
- Actions could be replacing bytes in a file, appending new data, etc
- Predicates could be checking the latest version number of the object
- Allow application to specify predicates required
- Not support explicit locks or leases, instead relying on update model
Virtualization with Tapestry
- Resources are virtual, where they are not tied to a specific physical instance, can be moved anytime
- A virtual resource is named by a GUID, contains the state required to provide some service
- Use Tapestry to locate resources, manage physical nodes, route messages
Replication and Consistency
- Data blocks are read only, so they can be replicated without worrying about consistency issues
- Mapping from a data object (AGUID) to its latest version (VGUID) needs to be updated in a consistent manner:
- each data object has a primary replica, which is responsible for serializing and atomically applying updates with a BFT certificate within its inner ring
Archival Storage
- Erasure code: any m of n fragments can reconstruct the message
- achieve higher fault tolerance with the same amount of storage overhead
- After an update, all data blocks are erasure coded, all encoded segments are distributed to nodes in the system using a deterministic function based on its fragment number and BGUID
- To read a block, a host collects m fragments from different nodes, and decode
Caching
- For each read, first query Tapestry to look for a cached copy of the data block
- After constructing a data block from fragments, publish cached copy to Tapestry
![image-20210726223931040](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210726223931040.png)
- What problem does this paper address?
- How to build a decentralized storage system where infrastructure is untrusted unless in groups?
- What is the author's main insight?
- Use Tapestry to do the routing and resource management
- Use primary replica inner ring and BFT signature threshold to ensure inner ring is BFT
- Use erasure encoding to achieve high durability
- Use caching to amortize frequent block reads
- Data blocks are read-only, simplifying replication and consistency
- What are the paper's key strengths?
- BFT decentralized storage system
- Efficient read by utilizing caching
- High durability using erasure coding
- What are the paper's key weaknesses?
- Write is expensive due to BFT protocol, erasure coding, dissemination tree, etc.
- Component failures are norm instead of exception. The system needs constant monitor, error detection, failure tolerance, automatic recovery
- Modest amount of large files (multi-GB)
- The majority of write operations is appending. Rare overwriting existing data, no random write.
- two kinds of reads: large sequential reads and small random reads
- Support concurrent appends to the same file, need atomicity with minimal synchronization overhead
- High sustained bandwidth is more important than low latency
- Traditional operations: create, open, close, read, write
- Special operations: snapshot and record append
- snapshot: creates a copy of a file or a directory tree at low cost
- record append: allows multiple clients to concurrently append to a file while guaranteeing atomicity
- A single master, multiple chunkservers, multiple clients
- Master: maintains all metadata in the system, including namespace, access control information, mapping from files to chunks, locations of chunks. Controls system-wide activities such as chunk lease management, garbage collection, chunk migration
- Chunkserver: store data chunks
- Client: asks master for specific chunks' locations, contact chunkserver directly for data chunks and byte range within the data truncks
- No caching on client or chunkserver
- client: no caching since mostly streaming read or working set too large to cache
- chunkserver: since local files, so Linux's buffer cache already caches
- Simplifies system design and allows sophisticated chunk placement and replication decisions
- Minimize master's involvement in read and writes to prevent bottleneck
- Master only handles metadata-related requests, no data transfers
- After receiving locations of requested chunk index, the client sends a request (chunk index + byte offset) to the closest replica for actual data
- Large chunks: size of 64MB
- Advantages:
- Reduce the number of communications required from client to master for chunk locations, especially for sequential read and write large files
- Reduce metadata size stored on master, can be stored in memory
- Disadvantages:
- internal fragmentation: can be mitigated with lazy allocation
- For a small file that consists of only a small number of chunks, chunkservers storing it can be hot spot. Mitigated with a higher replication factor
- Master stores three types of metadata:
- File and chunk namespaces (persisted by logging)
- File-to-chunk mapping (persisted by logging)
- chunk-to-chunkserver mapping
- Log mutations to operation log on the master's local disk
- Operation log is replicated to remote machines for fault tolerance
- Does not store chunk-to-chunkserver mapping, instead the master asks chunkservers for chunk information at startup
In-Memory Data Structure
- Metadata is stored in memory -> fast master operations
- In-memory data structure is compact:
- Less than 64Bytes per file (using filename prefix compression)
- Less than 64Bytes per 64MB chunk (chunk location + chunk size)
- Master periodically scans its entire state to:
- garbage chunk collection
- re-replication after chunkserver failure
- chunk migration to load balance
Chunk Location
- Master gains chunk location information by asking all chunkservers at startup
- Then the master is up-to-date since it is managing chunk placement and monitor chunkserver status with HeartBeat msgs
- The reason why not persisting chunk locations is that the chunkserver has the final word on what it has (e.g. errors may corrupt disks)
Operation Log
- Operation log contains the historical records of critical metadata changes
- Changes to metadata should not be seen by the clients before they are persisted to both master and remote disks
- Checkpointing is used to enable faster recovery
- internal state is structured such that checkpointing doesn't delay new changes
- In particular, master starts a new log file and creates the new checkpoint on a new thread
Metadata Mutations
- File namespace changes are solely handled by the master, so atomic
- use locking to guarantee atomicity and correctness
- use operation log to determine an order
File Region Mutation Result Types
- Defined: consistent + clients will see mutation writes in its entirety
- defined when a mutation succeeds without interference from concurrent writers
- Consistent (not defined): all clients will see the same data, regardless of which replicas; it may not reflect what any one mutation has written; consists of mingled fragments from multiple mutations
- consistent but not defined when concurrent successful mutations
- inconsistent (not defined): different clients may see different data at different times
Data Mutation
- Write:
- data to be written at an application-specified file offset
- Record Append:
- data (i.e. a record) is appended atomically at least once even in the presence of concurrent mutations, but at an offset of GFS's choosing
- The offset is returned to the client, is the beginning of a defined region that contains the record
- GFS may insert padding or record duplicates in between
- After a sequence of successful mutations, the mutated file region is guaranteed to be defined, and contain the data written by the last mutation. Achieved by:
- Applying mutations to a chunk in the same order on all its replicas
- using chunk version numbers to detect replicas that are stale due to chunkserver downtime
- Stale replicas will never involve in mutations or given to clients, they are garbage collected
- Stale client caches may allow clients to read from stale chunks, but not serious damage as it mostly returns a premature end of chunk rather than outdated data. The reader gets fresh data when it retries and contacts the master
- Data corruption is detected using checksumming
- Once a failure occurs, data is re-replicated from valid replicas as soon as possible
Implications for Applications
- Checkpoints may include application-level checksums to check how much has been successfully written
- Readers verify and process only the file region up to the last checkpoint (defined state)
- Appending is far more efficient and resilient to application failures than random writes
- Checkpointing allows writers to restart incrementally
- Concurrent record appends:
- record append-at-least-once
- may have padding and duplicates between records, rare record duplicates
- Reader can identify and discard extra padding and record fragments using checksums
- To remove duplicate records, filter using unique ID in the records
- Each mutation is performed on all the chunk's replicas
- Lease:
- the master grants a chunk lease to a replica (the primary)
- The primary picks a serial order for all mutations to the chunk
- All replicas follow this order when applying mutations
- Lease has a timeout, but can be extended indefinitely
- Lease grant and request are piggybacked on HeartBeat msgs
- Lease minimizes management overhead of the master
Control Flow Steps:
- Client asks master for primary and secondary replica locations
- Client pushes data to all replicas, wait for all acks
- Client sends primary a request to write
- Primary serializes mutations, apply to local
- Primary sends the same order to all secondaries, wait for all acks
- Primary return to client as success
- Control and data flows are separated
- Data is pushed linearly along a chain (instead of a tree) of chunkservers in a pipelined fashion
- avoid network bottlenecks
- avoid high-latency links
- fully utilize network bandwidth
- minimize latency
- Push to the machine closest to you, starts immediately after receiving some data
- Client only specifies the data, GFS appends it to the file at least once atomically at an offset determined by the GFS
- Add additional logic to the write control flow:
- after client send record append to the primary, the primary checks if appending the record will exceed the maximum chunk size
- if no, append at the current end of file. Done.
- if yes, pad the current chunk, create a new chunk, and ask the client to try again
- after client send record append to the primary, the primary checks if appending the record will exceed the maximum chunk size
- After successful record append,
- does not guarantee all replicas are bytewise identical
- guarantee that data is written at least once as an atomic unit
- the data has to be written to the same offset on all replicas of some chunk
- all replicas are at least as long as the end of record
- any future record will be assigned a higher offset
- Consistency guarantee:
- The region in which a successful record append operations is defined
- Advantages:
- No synchronization (e.g. distributed lock manager) is needed
- Use copy on write
- When snapshot,
- Need to revoke all leases or wait for them to expire. Ensure subsequent writes need to contact master for new lease holder
- Log the snapshot operation to disk
- Apply to in-memory state by duplicating the metadata for the source file
- The newly created snapshot files reference to the original chunks
- When client first writes,
- master picks a new chunk and asks all replicas to replicate locally, change pointer to the new chunk, then write
- Allow multiple operations to be active
- Use locks over regions of the namespace to ensure serialization
- GFS represents its namespace as a lookup table mapping full pathnames to metadata
- With prefix compression, can fit in memory
- Each node in the namespace tree has a lock
- Acquire read lock on directory if modifying files in it
- Acquire write/read lock on files as needed
- Locks are allocated lazily (create lock when acquired) to reduce storage overhead
- Multi-level distribution: machine level, rack level
- maximize data reliability and availability
- maximize network bandwidth utilization
How to choose a chunkserver to add replica
- A chunkserver that has below-average disk space utilization
- A chunkserver that has a limited number of "recent" creations
- Replicate across racks
- Lazy garbage collection:
- when a file is deleted, the master logs the deletion immediately, but the file is just renamed to a hidden name without actually deleting it
- During master's regular scan of the file system namespace, it removes any hidden files, and its in-memory metadata is erased, so severs the links to its chunks
- During a similar scan of the chunk namespace, the master identifies orphaned chunks and erase metadata for those chunks
- During HeartBeat messages, master and chunkserver exchange chunk information and chunkserver can now delete such chunks
- Advantages of lazy over eager:
- simple and reliable in a large-scale distributed system where component failures are normal: provides a uniform and dependable way to clean up any replicas now known to be useful
- Merges storage reclamation into regular background activities: done in batches, cost is amortized, done when master is free
- delay in deleting provides the option to undo accidental, irreversible deletion
- Disadvantage and mitigation:
- Hinder users from actively manage storage space when it is limited
- provide option to expedite deleting such files
- Hinder users from actively manage storage space when it is limited
- The master tracks chunk version number to distinguish up-to-date and stale replicas
- Whenever the master grants a new lease, it increments version number and notify all replicas
- unavailable replicas become stale now
- Master detects stale replicas when chunkservers report its chunks and version numbers
- Master removes stale replicas during garbage collection, and before that, takes it as non-existing
- Master sends client/chunkserver chunk version number to verify up-to-date chunks
Fast Recovery
- Master and chunkservers restart in seconds
- Does not distinguish normal or abnormal termination
Chunk Replication
- Chunk is replicated on multiple chunkservers across racks
- May use parity or erasure codes for cross-server redundancy
- good for append and read workload (instead of small random writes)
Master Replication
- Master state is replicated for reliability
- Including operation log and checkpoints
- Use checksumming to detect corruption and use replicated chunks for correction
- Each chunkserver needs to independently verify its own copy because it is not practical to compare with other replicas
- Checksum is stored like a log (persistent)
- Checksumming has little effect on read performance, only a small amount of extra data for verification
- no I/O required
- Checksum is optimized for append: incrementally update the checksum for the last partial checksum block
- detect corruption at reads
- Chunkserver verifies inactive chunks during idle periods
- Diagnostic logging: log major events and all RPC requests and replies
- Minimal cost: sequential and asynchronous
- What problem does this paper address?
- How to design a storage system with mostly append workload, large files, and failures are normal?
- What is the author's main insight?
- A single master and multiple chunkservers
- Add atomic record append operation
- Large chunk size
- Use lease for consistency
- What are the paper's key strengths?
- Optimized for given Google workload
- simple structure: highly scalable, fault tolerant
- What are the paper's key weaknesses?
- All metadata in memory still possible nowadays?
- Master single point of failure?
- Bigtable is a distributed storage system that is used to manage structured data
- reliably scale to petabytes of data
- thousands of machine
- Goals:
- wide applicability, scalability, high performance, high availability
- Support a variety of workloads: throughput-oriented batch-processing jobs or latency-sensitive serving of data
- Provides clients with a simple data model that supports dynamic control over data layout and format
- clients can control locality of their data
- Bigtable is a sparse, distributed, persistent multi-dimensional sorted map
- The map is indexed by a row key, a column key, and a timestamp
- Each value is an uninterpreted array of bytes
Row
- Row keys are arbitrary strings
- Every read or write of data under a single row key is atomic (regardless of columns)
- Maintain data in lexicographic order by row key
- Tablet: row range
- row range is dynamically partitioned, each is called a tablet
- Tablet is the unit of load balancing and distribution
- Reads of short row ranges are efficient and require less communications
- Clients can select their row keys to store data with locality together
Column
- Column family: column keys are grouped into sets called column families, which is the unit of access control
- All data stored in the same column family is usually the same type
- Intention is to keep the number of column families small; the number of columns can be large
Timestamp
- Each cell in the Bigtable can contain multiple versions of the same data
- Versions are indexed by timestamp
- Old cell versions can be garbage collected as desired
- API provides functions for creating, modifying, and deleting tables and column families
- Can change cluster, table, and column family metadata
- Operations are on individual row keys
- Use GFS to store log and data files (SSTable files)
- SSTable provides a consistent, immutable, ordered map from keys to values
- Use scheduler to schedule jobs involved in BigTable serving
- Use Lock service for master election, location bootstrapping
- MapReduce: used to read/write BigTable data
- One master server, many tablet servers. Tablet servers can be dynamically added or removed
Master
- Master is responsible for:
- assigning tablets to tablet servers
- detecting the addition and expiration of tablet servers
- balancing tablet server load
- garbage collection of files in GFS
- handle schema changes (e.g. table and column family operations)
Tablet Server
- Tablet server is responsible for:
- manage a set of tablets (row ranges)
- Handle read and write requests to the tablet it is responsible for
- Split tablet if it is too large
Client
- Client talks directly to tablet server for data
- Use chubby to find tablet location
How to get Tablet Location
- Three-level hierarchy to store tablet location information
- Chubby file
- Root tablet
- Metadata tablet
- Steps:
- The client contacts Chubby for the root tablet location
- Contacts root tablet server too get location of all METADATA tablets
- Contact METADATA tablet server to get location of its user tablet
- The client caches METADATA tablet locations to amortize communication cost
Tablet Assignment
-
Persistent state of a tablet is stored in GFS (instead of on tablet server)
- Tablet server is responsible for serving requests
- Modify persistent state in GFS if needed
-
Each tablet is assigned to one tablet server at a time
- master keeps track of the set of live tablet servers and their assigned tablets
-
Master uses Chubby to keep track of tablet servers:
- each tablet server acquires a uniquely-named file in a Chubby directory in order to serve
- When a tablet server stops to server, it releases the lock so the master can reassign its tablets
- When a tablet server loses the lock due to failure, it attempts to reacquire the lock and restart; if the file is deleted, the tablet server can no longer serve and will terminate
-
The existing sets of tablet changes if:
- table created or deleted
- multiple tablets merged into one
- one tablet split into two
- Master initiates 1 & 2; for 3, tablet server notifies master
Master Assignment
- When master starts, it does:
- grab a unique master lock in Chubby
- Scans server directory in Chubby to find active tablet servers
- Ask tablet servers for their assigned tablets
- Master scans the METADATA table to learn the set of tablets. Add to unassigned tablets if found
- Master kills itself if its Chubby session expires
- Persistent state of tablets is stored in GFS
- Updates are committed to a commit log that stores redo records, among updates:
- Recently updates are in memory in a sorted buffer called memtable
- older updates are stored in a sequence of SSTables
- Client write operation:
- server checks write request is well-formed and authorized
- write to commit log
- insert to memtable
- Client read operation:
- server checks well-formed and authorized
- merge SSTables and the memtable (efficient since both are sorted)
- read on the merged view
Tablet Recovery
- Get the list of SSTables that comprise a tablet from METADATA table
- Get a set of redo pointers into any commit logs that may contain data for the tablet
- Reconstruct tablet by reading all SSTables into memtable and then apply all commits
Minor compaction:
- when memtable size reaches a threshold, convert it into an SSTable, write to GFS, initiate a new memtable
- Two goals:
- Shrink memory usage of the tablet server
- during tablet recovery, reduce the amount of data read from commit log
Merging Compaction
- Reads the contents of a few SSTables and the memtable, merge into a new SSTable
- Bound the number of SSTables
Major Compaction
- Rewrites all SSTables into exactly one SSTable
- Some deleted data may still be there by overlapping it with a delete operation
- use major compaction to reclaim resources used by deleted data
- happen periodically in the background
Locality Group
- Group multiple family columns together into a locality group
- In each tablet, a separate SSTable is used to store values in a locality group
- More efficient reads, don't read unnecessary info
Compression
- Client has the option to compress their SSTables (per-block basis) for a locality group
- Client can choose format
- Very good compression ratio since:
- All pages from one host are stored close to each other
- Locality group has similar data
- multiple versions of the same value
Caching for read performance
- Scan cache that caches key-value pairs: temporal locality
- Block cache that cache SSTable blocks: spatial locality
Bloom Filters
- For read:
- ask whether an SSTable might contain any data for a specific row/column pair
Commit Log
- Use one commit log per tablet server (Instead of one commit log per tablet)
- More complicated recovery:
- Sort log by tablet, then each new tablet server needs to do one contiguous read
- What problem does this paper address?
- How to design a general-purpose data-center distributed storage system for structured data that is highly scalable, high-performance, and flexible to many applications?
- What is the author's main insight?
- distributed multi-level mapping
- Use building blocks wisely
- Fully utilize client-specifiable locality
- What are the paper's key strengths?
- General-purpose, highly scalable, high-performance, highly available
- What are the paper's key weaknesses?
- Need clients to be educated to utilize locality features
Map
- Input: a set of input key/value pairs
- Output: a set of intermediate key/value pairs
- MapReduce groups together all intermediate values with same intermediate key, pass them to Reduce
- Map invocations are distributed across multiple machines by partitioning input data into a set of splits; Each split can be processed in parallel on different machines
Reduce
- Input: an intermediate key and a set of values for the key
- Output: merge these values to produce a smaller set of values
- Intermediate values are supplied to Reduce with an iterator
- Distributed by partitioning the intermediate key space into multiple partitions
- MapReduce library splits input file into M splits. Start programs on each machine in the cluster
- One program is the master, all others are workers. Master assigns map and reduce jobs to each worker.
- Map worker reads assigned input split, parse key/value pairs and send to Map function. Output intermediate kv pairs to memory.
- Buffered pairs are partitioned by intermediate keys. Stored in local disk. Send disk location to master. Master sends location to reduce workers.
- Reduce worker learns the locations, use RPC to transfer intermediate data from map worker's local disk.
- Reduce worker sorts intermediate keys, group values with the same key together
- Reduce worker pass each unique intermediate key and its values to Reduce function. Append output to final output file.
- After all map and reduce jobs are completed, the master wakes up the program.
- For each map and reduce job, the master keeps its state and its assigned worker
- The master stores the location of map worker's local disk, for each map job
Worker Failure
- Master periodically ping worker to ensure they are alive
- If some worker is dead,
- reschedule both completed and in-progress map jobs (since result lost)
- reschedule only in-progress reduce jobs
- Reduce workers are notified if any map job is re-executed
Master Failure
- Periodically store master's state in checkpoints
- Restart from checkpoint
Deterministic vs. Non-Deterministic
- If deterministic map and reduce functions, then the output is the same as a non-faulting sequential execution
- If non-deterministic, the second reduce function may produce a different result because its intermediate input may be different (due to failed map tasks)
- GFS stores files on different machines
- MapReduce schedule map jobs considering its input file's location
- exploit locality of input files and map workers
- Choose larger M and R than the number of machines in the cluster
- Run many different tasks improve load balancing and recovery speed
- There are cases where a specific machine is unusually slow
- The master schedules backup executions for in-progress tasks
- What problem does this paper address?
- A distributed system for large-scale, data-intensive, parallel data processing on community clusters, simple abstraction, handle parallel computing, job scheduling, fault tolerance.
- What is the author's main insight?
- A simple map and reduce abstraction
- Can utilize machines on a cluster to do parallel jobs
- A centralized master design
- fault tolerance on workers
- What are the paper's key strengths?
- Easy to use
- Can distribute parallel jobs to machines
- fault tolerance on workers
- What are the paper's key weaknesses?
- Failure handling for master is not discussed
- MapReduce does not support one workload well:
- Reuse a working set of data across multiple parallel operations
- e.g. iterative ML algorithms, interactive data analysis tools
- Cache repeatedly used working dataset to reduce the overhead of fetching from disk everytime
- Developers
- write a driver program that implements the high-level control flow of their application
- launch various operations in parallel
Resilient Distributed Dataset (RDD)
-
A resilient distributed dataset is a read-only collection of objects partitioned across a set of machines, that can be rebuilt (from its parent) if a partition is lost
-
The elements of an RDD does not need to exist in persistent storage
- a handle to an RDD contains enough information to compute it starting from data in reliable storage
-
Datasets will be stored as a chain of objects capturing the lineage of each RDD, each object contains a pointer to its parent and information about how the parent was transformed
-
RDD can always be reconstructed if nodes fail:
- its partitions are re-read from their parent datasets and eventually cached on other nodes
-
To construct a RDD:
- from a file in a shared file system
- Divide an array of files
- Transform an existing RDD with functions
- Change the persistence of an existing RDD between cache and save modes
Parallel Operations
- Reduce: combine dataset elements
- Collect: send all elements to the master
- foreach: iterate through each element with a function
- What problem does this paper address?
- How to develop a large-scale data-intensive parallel distributed system that supports a fast reusing working set?
- What is the author's main insight?
- Use caching to store dataset in memory instead of disk
- Use lineage & parent information to reconstruct lost in-memory data from persistent storage
- What are the paper's key strengths?
- Utilize memory to reduce the cost of reading from disk again and again when reusing a working set
- What are the paper's key weaknesses?
- More practical for batch workloads instead of interactive workloads
- Ease to use: DSM < RPC < MPI
- Complexity added to OS: MPI < RPC < DSM
- Message passing is the basis of most interprocess communication in distributed systems
- MP is at the lowest level of abstraction
- Requires application programmer to be able to identify expected destination process, source process, message, and data types
Syntax
- Simplest primitives:
- send(receiver, message, sender(optional))
- receive(sender, message)
- send requires to know the receiver process and message data
- receive requires to know the expected sender and provide a storage buffer for the message
Blocking vs. Non-blocking
- Blocking is easier for programming and debugging, but CPU is idle when blocked. More often chosen.
- Non-blocking allows concurrent execution when sending, but need interrupt to be informed when message is sent and message buffer is cleared
Buffered vs. Unbuffered Messages
- Unbuffered: send directly to the receiving process. Problem when send() is called before receive() because the address in send does not refer to an existing process.
- Buffered: saved in a buffer until a process is ready to receive them. Messages are queued in a buffer waiting until requested by the receiver.
Reliable vs. Unreliable send
- Unreliable: after sending the message, does not expect acknowledge nor retransmission if message is lost
- Reliable: wait for acknowledgement and will retransmit if message lost.
- Lost message handled either:
- OS retransmission
- OS notifies sender
- Sender detects it itself
Direct vs. Indirect
- Indirect: messages are sent to a port, then receiver receives it from the port
- Direct: message sent direct to the process itself, named explicitly in the send
Fixed vs. Variable Message Size
- A tradeoff between implementation difficulty and programming complexity
Passing data by value/reference
- Mostly pass by value because the processes execute in separate address space
MP Difficulty
- Application programmer needs to control data movement at the granularity of processes, control the synchronization
- MP leaves the application programmer with the burden of explicit control of data movement
- RPC increases the level of abstraction and provides semantics similar to a local procedure call
Syntax
- Syntax of RPC
- call procedure_name(value_arguments; result_arguments)
- receive procedure_name(in value_parameters; out result_parameters)
- reply(caller, result_parameters)
- The client process blocks at the call() until the reply is received
- The remote procedure is the server processes which has already begun execution on a remote machine. Receive() blocks until it receives a message from the sender. The server reply() after finishing the task
Semantics
- Semantics of RPC is the same as a local procedure call: the calling process calls and passes arguments to the procedure, and it blocks while the procedure executes. When the procedure completes, it return results to the calling process.
- For instance,
- The execution of the call() generates a client stub which marshals the arguments into a message and sends the message to the server machine
- On the server machine, the server is blocked waiting for message. After receiving the message, a server stub is generated and extracts the parameters from the message, and pass to the procedure
Binding
- binding is needed to provide a connection between the name used by the caller and the location of the remote procedure
- Implemented by:
- using OS, storing a static or dynamic linker between the procedure name and its location on another machine
- suing procedure variables, which links to the procedure location
Communication Transparency
- The user should be unaware of using a remote procedure call (instead of local)
- Three difficulties:
- Failure detection and correction due to communication and site failures: can result in inconsistent data because of partially completed processes, leave to programmer to deal with
- Parameter passing: only pass value parameters
- Exception handling: use available exceptions
Concurrency
- Since call and receive are blocking, single-threaded machines can cause significant delays
- Use mechanisms to execute calls concurrently
Heterogeneity
- Use a static interface declaration of remote procedures to allow communication on different OS or language
Summary
- RPC abstracts away communication and transmission (compared to MP)
- True transparency is hard and unsolved
- Distributed Shared Memory is memory which, although distributed across machines over a network, gives the appearance of being centralized
- The memory is accessed through virtual addresses, so processes are able to communicate by directly writing and reading data which are directly addressable
- DSM relieves application programmers from concerns of message passing
- Higher complexity for OS as it still needs to send messages between machines to read locally unable memory, and maintain replicated memory consistency
Syntax
- Same as a normal centralized memory multiprocessor:
- read(shared_variable)
- write(data, shared_variable)
- The OS locates the variable through its virtual address
Structure and Granularity of the Shared Memory
- The memory can be in the form of:
- an unstructured linear array of bytes
- structured forms of objects
- Fine or Coarse grained:
- data should be shared at bit, word, or page level
- coarse-grained: page-based distributed memory, where paging takes place over the network instead of to disk. Offer sequential consistency at the cost of performance.
- fine-grained: higher network traffic
Consistency
- If one copy, then a request for a non-local piece of data results in a trap, causes OS to fetch it remotely. If a piece of data is requested by multiple machines, thrashing happens.
- If multiple copies, consistency among replicated data becomes the concern:
- Consistency model determines the condition under which memory updates are propagated
- Cannot just use cache coherence protocol because its strict consistency models cause too much network traffic
- Stricter models result in more network traffic, thus worse performance
- Weaker models result in less network traffic, thus better performance, but makes the programming model more complicated. Weaker consistency is a concern of OS designers.
Synchronization
- Shared data must be protected by synchronization primitives
- Three methods:
- Managed by a synchronization manager
- Responsibility of the programmer
- Responsibility of the system developer (implicit at application level)
Scalability
- Scales better than tightly-coupled shared memory multiprocessors
- Limited by physical bottlenecks (e.g. buses)
Heterogeneity
- Hard to accommodate different machines, languages, or OS at the page level
![image-20210802125128246](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210802125128246.png)
- Fine-grained, heterogeneous computations
- milliseconds to hours
- heterogeneous hardware
- Flexible computation model
- Both stateless and stateful computations
- Dynamic execution
- the order of computation finish is not known in advance
- next executions are determined by the execution
Task
- Stateless and side-effect free
- Operate on immutable objects
- outputs are determined solely by their inputs
Actor
- Stateful execution
- Operate on mutable objects
- Given a handle to access the actor in the future
![image-20210802135703369](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210802135703369.png)
- Dynamic task graph computation model
- execution of both remote functions and actor methods are triggered by the system when their inputs are available
- computation graph is constrcuted from a user program
- Two types of nodes:
- data objects
- remote function invocations (tasks)
- Three types of edges
- data edges: data dependencies
- control edges: computation dependencies
- stateful edges: connect stateful execution on the same actor
Application Layer
- A driver
- Stateless workers: execute stateless tasks by the driver or other workers
- Stateful actor: execute on the method it exposes
Global Control Store
- Maintains the entire control state of the system
- lineage information
- A key-value store with pub-sub functionality
- Sharding to achieve scale and per-shard chain replication
- Store object metadata in the GCS rather than in the scheduler, decoupling task dispatch and task scheduling
- because involving the scheduler in each object transfer is prohibitively expensive
Bottom-Up Distributed Scheduler
-
Two-level hierarchical scheduler
- global scheduler
- local scheduler
-
Each node's local tasks are tried to schedule locally first
-
Only send to global scheduler if:
- local is overloaded
- local resource is insufficient to schedule the task
-
Global scheduler determines which node using:
- estimated queue time
- estimated input transfer time
In-Memory Distributed Object Store
- Store the inputs and outputs of every task
- Implement object store with shared memory
- allow zero-copy data sharing between tasks running on the same node
- Immutable data only
- non-local inputs are replicated to the local object store before execution
- Outputs are written to the local object store
- High throughput due to in-memory operations
- What problem does this paper address?
- A highly-scalable computation framework that handles heterogeneous requirements in a) stateful vs. stateless; b) short latency-sensitive vs. batch computation; c)
- What is the author's main insight?
- A GCS storing lineage information, data object location
- A bottom-up scheduler
- An in-memory distributed object store
- lineage-based fault tolerance with replication-based GCS
- What are the paper's key strengths?
- A general-purpose RL framework that supports training, simulation, etc.
- Satisfy both task parallel and actor stateful execution
- Highly scalable with sharded GCS, distributed in-memory object store, hierarchical scheduling
- What are the paper's key weaknesses?
- Unable to use application-optimized scheduler
- MapReduce and parallel database systems can serve the same set of computation requirement through different approaches
- What are the differences between MR and parallel database approaches?
MapReduce
- Use a central scheduler to schedule Map and Reduce tasks to nodes
- Run M Map tasks and R Reduce tasks
- A Map task takes a partition (a set of key/value pairs) of input file and execute a user-defined Map function, produces a set of intermediate key/value pairs, and split them into R disjoint buckets, stored in local disk storage
- Each Reduce task takes its part of bucket from all M Map tasks, apply a user-defined Reduce function that combines intermediate outputs based on key, output a part of final output, stored in global storage
Parallel DBMSs
- Standard relational tables (transparent physical location)
- Data are partitioned over cluster nodes
- SQL
- Join processing: T1 joins T2
- if T2 is large, then hash partition T1 and T2
- send partitions to different machines
- (similar to split-copy in MapReduce)
- Query Optimization
- Intermediate tables not materialized by default
![image-20210803142937626](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210803142937626.png)
Schema Support
- MapReduce:
- More flexible, good for one application, bad for multiple applications sharing one data set
- Parallel DBMS
- Relational schema required, good for multiple applications sharing
Programming Model & Flexibility
- MapReduce:
- low level
- very flexible
- Parallel DBMS:
- SQL
- user-defined functions, stored procedures, user-defined aggregates
Indexing
- MapReduce:
- No native index support, can implement their own but hard to share
- Parallel DBMS:
- Hash/b-tree indexing well supported
Execution Strategy & Fault Tolerance
- MapReduce:
- Intermediate results are saved to local files
- If a node fails, run the node-task again on another node
- Large number of disk seeks at mapper machine during intermediate result transfer
- Parallel DBMS:
- Intermediate results are pushed across network
- If a node fails, the entire query needs to run again
Avoiding Data Transfer
- MapReduce:
- locate Map close to data
- Parallel DBMS:
- lots of optimization
- e.g. where to perform filtering
Node Scalability
- MapReduce:
- 10000's of commodify nodes
- 10's of Petabytes of data
- Parallel DBMS:
- <100 expensive nodes
- Petabytes of data
Setup and Configuration
- MapReduce:
- easy
- DBMS:
- hard to configure
Ease of Use
- More familiar with language used by MapReduce compared to SQL
Performance
- DBMS is faster than MR because:
- Indexing: B-tree indices to speed the execution of selection operations
- Column store: column oriented storage
- Compression: aggressive compression and operate on compressed data
- Parallel algorithm for querying large amount of data
- What problem does this paper address?
- What's the differences between approaches of MapReduce and Parallel DBMS for large-scale data-intensive computation?
- What is the author's main insight?
- For the size of 100 nodes, Parallel DBMS has a better performance due to indexing, storage schema, compression, parallel algorithm.
- MapReduce is easier to use, may perform better in a larger cluster with commodity hardware
- What are the paper's key strengths?
- What are the paper's key weaknesses?
- First come first serve
- Round robin
- Priority system
- cons: highest priority always wins, starvation risk. Priority inversion: high-priority jobs can be blocked behind low-priority jobs
- Multi-level feedback scheduling
- cons: fairness only over very long term
- Proportional-share scheduling algorithm
- Give each job its proportional number of tickets
- On each time slice, pick a ticket and give owner the resource
- On average, resource fraction (CPU time) is proportional to number of tickets given to each job
To Assign Tickets
- Priority determined by the number of tickets each process has
- To avoid starvation, each job gets at least one ticket
- Probabilistically fair
- Advantage over strict priority scheduling:
- behaves gracefully as load changes: adding or deleting jobs affect all jobs proportionally
- Mostly fair, but short-term unfairness is possible (see stride scheduling)
- If a job depends on another job, it gives all its tickets to the other job
- solves priority inversion problem
- Compensate for resources assigned but not used (yield)
- Ticket inflate proportional to the amount of time you did not use, until next time you win
- What problem does this paper address?
- A scheduling scheme that provides fair share, avoids priority inversion, simple and easy-to-use?
- What is the author's main insight?
- Lottery-based scheduling, each job gets a proportional number of tickets
- Hand over tickets to another job if you depend on it to finish
- What are the paper's key strengths?
- Simple
- Solves priority inversion
- Fairness in the big picture
- What are the paper's key weaknesses?
- ticket inflation
- short-term unfairness
- Sort resource requirement in an increasing order
- Give the lowest demand first, with min(Capacity/requesters, its demand)
- Give the extra resource to other resources if any left
-
Any person is worse if they lie about their demand
-
Weighted Max-Min fairness: takes weights into consideration
- Higher priority gives higher resource
- Share guarantee of at least C/n
- Strategy-proof
- Isolation: Users cannot affect each other
- Dominant resource: the resource that a user wants the highest proportion (e.g. 1/2 is greater than 4/100)
- Different users may have different dominant resources
- DRF applies max-min fairness across users' dominant resources
- DRF maximizes the smallest dominant share in the system, then the second-smallest, and so on
Steps
- DRF picks the user with the lowest dominant share(max{utilization rate}), run one of its tasks
- Update the dominant share, consumed resources
- Repeat
- e.g.
![image-20210803174509311](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210803174509311.png)
Characteristics
- O(log n) time per decision, n is the number of users
- Need to determine demand vector for each task
Weighted DRF
- Modify dominant share calculation to max{utilization rate / weight}
- Strategy-proof: no better to lie about its demand
- Share guarantee: each user is guaranteed to have at least C/n share
- Envy-free: no task wants to swap with another task's resources
- Pareto Efficiency: it is not possible to give extra resource to any user without hurting at least one other user
- Single-resource fairness: if only one resource, then max-min
- Bottleneck fairness: if all tasks have one dominant resource, than it is shared according to max/min fairness
- What problem does this paper address?
- How to schedule heterogeneous resources to users with heterogeneous demand in a fair way?
- What is the author's main insight?
- Determine a user's dominant resource and schedule tasks based on it
- What are the paper's key strengths?
- Strategy-proof, share guarantee
- What are the paper's key weaknesses?
- If the number of users is too many, long scheduling time
- A deterministic version of lottery scheduling
- Every thread has a "stride" that is inversely proportional to its priority
Steps:
- Initially, everyone's "pass" is equal to its "stride"
- Every time slice, choose the thread with the lowest pass
- After a thread is run, increment its "pass" by "stride"
- repeat step 2
Advantages over Lottery scheduling:
- deterministic variance from the ideal ratio
- short-term stability
Monolithic Kernel
- Running entirely in the kernel mode
Microkernel
-
Modular design that separates OS into two parts:
- Control basic hardware resources (i.e. microkernel), responsible for low-level process management, handle message passing, interrupt handling, virtual memory
- Control unique characteristics of the environment for applications, such as file system, memory management.
-
Advantages:
- Portability: microkernel as a narrow waist that provides hardware independence for other parts of OS and application
- Extensibility & customization: environments can evolve independent of microkernel
- Functionality & performance for kernel: simpler functionality so easier to parallelize, distribute, and secure
- Flexibility: system environment can run remotely
- Real-time: kernel does not need to hold long locks & OS environment is preemptable, can provide real-time support
- Fault Isolation
- Security
- More reliable (less code in kernel)
Process and thread management
- Resources are allocated to processes, which consists of an address space and communication access to system
- Thread performs computation and consumes its process' resources
- Threads are scheduled by the Mach kernel
- User space programs can control scheduling policy
Interprocess Communication
- Interprocess communication is implemented exclusively through ports (similar to a handle)
- Processes, threads, memory objects, and Mach process services are all manipulated by sending messages to ports that represent those objects
- The port abstraction can be accessed through network
Memory Object Management
- Memory objects can represent anything that makes sense to access through memory references (e.g. I/O devices, disk I/O)
- Processes use ports to interact with memory objects
- Process address space is a collection of mappings from linear addresses to offsets within memory objects
System Call Redirection
- Mach kernel allow redirecting certain system calls or traps back to the calling process to handle
- Require one kernel entry and one kernel exit
- Emulation library is inherited by child processes
- Exceptions can also be redirected
Device Support
- Each device is represented as a port, can talk to it from threads
User Multiprocessing Support
- Allow user-level OS/applications to use multithreading without involving kernel
Multicomputer Support
- All communication happens through IPC (via port), which can be done through networks
- Key idea: treat OSes like application programs
- Three approaches:
- Native OS systems
- Large granularity server systems
- Small granularity server systems
Native OS systems
- key idea: Put everything that a kernel would handle into an emulation library, where a system call will be redirected to
- Allows most of original system calls to execute without change
- Emulation library responsible for system calls and virtualizing access to hardware devices
- Similar performance to other UNIX implementations
Large Granularity Server Systems
- Key idea: concentrate most non-native OS's functionality in a single server
- Emulation library redirects again to the server, which handles system calls and memory management
- Requires porting but most code can be reused
Small Granularity Server Systems
-
Key idea: One server (module) per non-native OS's functionality
-
Each service is implemented in a separate module
-
Key advantage:
- decomposition means isolation
-
One server can service multiple client OSes
Performance
- Native OS, large granularity, and UNIX have similar performance
- The amount of IPC is too high for small granularity system, so lower performance
- What problem does this paper address?
- How to implement OS using a modular design to improve portability, evolvability, security, etc.?
- What is the author's main insight?
- Separate OS into two parts: only the necessary hardware management part runs in the kernel, other OS functionality in the user-level
- Kernel manages CPU, IPC, virtual memory, device I/O
- Use system call redirection to allow user-level processes to execute system calls
- What are the paper's key strengths?
- Microkernel allows better portability, evolvability, security, flexibility
- What are the paper's key weaknesses?
- Performance metrics are missing
- microkernel significantly reduces the amount of kernel code, making a formal verification of correctness possible.
- The assumptions are hardware, compiler, assembly code, cache management, boot sequence.
- It shows that formally verifying a system is possible
- It is very time consuming (years of time) to formally verify even microkernel
- No clear mentions of what correctness guarantees are achieved, only mentioned ~80 invariants
- Extend the kernel (called spindle) at runtime through statically checked extensions
- Safety ensured using programming language
- Event/handler abstraction
- spindle listen to and handle events (e.g. spindle can listen and react to page faults and error)
Spindle Capabilities
- Run code in the kernel in general
- Listen to and handle events
- Fine-grain hardware resource manipulation
- Define new syscalls
- What problem does this paper address?
- How to modify OS to provide extended functionality in a minor way?
- What is the author's main insight?
- Dynamically load extensions to the kernel, which is statically checked
- Partition code across the user/kernel boundary to avoid extra IPC and switching
- Use special programming language to ensure safety
- Use an event/handler abstraction for the spindle
- What are the paper's key strengths?
- Minor modification to OSes (except PL requirement)
- extend functionality of existing OSes, while maintaining safety?
- What are the paper's key weaknesses?
- require special PL
- Safety cannot be secured by extensions
- Extensions should run in user space instead
- Abstraction is generalization that is often an API in CS, hides implementation details
- Advantages of abstractions:
- Simpler. Easy to understand and use
- Standardization. Many implementations all satisfy the abstraction
- Disadvantages of abstractions:
- It is a compromise, the least common denominator. not perfect for each use case
- Low performance (compared to tweaking the software for each use case)
- Possible bloated software
- Low extensibility:
- abstractions are overly general
- apps cannot control resource management
- Implementations are fixed
- Low performance:
- context switching is expensive
- Generalization and hiding information affect performance
- Separate resource management and protection
- kernel: resource sharing, protect hardware, protect LibOSes from each other
- untrusted LibOS: implement traditional OS abstractions, manage resources
- Advantages:
- Efficient because LibOS is in user space
- Apps can link to their choice of LibOS for better performance
- Minimalist kernel design
-
Securely expose hardware
- Exokernel should only manage resources to the extent required by protection (i.e. allocation management, revocation, and ownership)
-
Expose allocation
- LibOS should be able to request specific physical resources
- LibOS should participate in every allocation decision
-
Expose names
- export physical names such as physical page number
- export bookkeeping data structures such as TLB entries
- avoids layer of indirection, better performance
-
Expose revocation
- Apps should be notified when their resource is being taken away
- Apps can determine which resource to relinquish
-
Exokernel still arbitrate resource contention with its own policy
Secure binding
- Goal: separate authorization and actual use of the resource
- Only perform expensive authorization at bind time
- Future accesses only need simple operations for protection checks
- e.g. TLB fault -> LibOS maps virtual address to physical address and load into kernel (expensive authorization here) -> application accesses only need simple access checks
Visible Resource Revocation
- Revocation is like a dialogue where Exokernel revokes some page and LibOS saves the page to disk and frees it
- Higher latency
Abort Protocol
- Forcefully break LibOS's all secure bindings if it fails to respond
- Inform the LibOS afterwards
- What problem does this paper address?
- How to design an OS that provides minimal kernel design and allow maximum freedom to applications for a better performance?
- What is the author's main insight?
- Separate protection and resource management
- Kernel does protection only, and provides maximum freedom to untrusted LibOS which does resource management instead
- What are the paper's key strengths?
- Modularity
- applications can choose LibOS that satisfies its choice the most
- What are the paper's key weaknesses?
- Cost of development and portability is high
- OS today have protection
- file system with RBAC
- process protection
- memory protection
- Problem now?
- Ignoring information flow
- Process P can read a secret file and write it to public space
Main Idea
-
Files and processes colored
- Label private stuff RED
- Label public stuff GREEN
-
Enforce communications allowed/disallowed
Six Kernel Objects
- Segment (data), array of bytes
- Threads
- Address space
- Device (Network)
- Gate (IPC)
- Container (directory)
Information Flow
- Every kernel object has a label
- label describes the security property of the object
- HiStar will only allow kernel objects to interact (information to flow) if two kernel objects have consistent labels
Unix vs. IFC
- IFC tracks information flow so that any information flow out of high level gets tainted as high
- If some process doesn't have high level permission, it cannot access the information no matter how malicious it is
- In Unix, a process may mistakenly leak information to a public file, screen, or network
Kernel
- HiStar offers a minimalist kernel that exposes information flow
- Using labels on kernel objects, applications can implement security
- What problem does this paper address?
- How to design a kernel that implements and enforces mandatory access checks and allow applications to build on it?
- What is the author's main insight?
- Use information control flow, where each kernel object is labeled
- More tainted object (e.g. password) cannot be accessed by lower tainted object (e.g. network)
- What are the paper's key strengths?
- Provide secure kernel where less privileged objects cannot access more privileged (tainted)
- What are the paper's key weaknesses?
- Performance
- Automatically checkable
- specify checkable invariants
- automatically generate code from proof
- Verify file system in layers starting from a log at the bottom and then each layer above
- Crash Hoare Logic extends Hoare Logic (precondition, postcondition) with recovery procedures to ensure integrity under fail-stop conditions
- contains the states a system can enter in a crash
- Crash conditions are used to prove the log (rolled back)
- The proof is drastically simplified by proving the core logging mechanism, upon which safe transactions can be assumed
- Lots of things
- Complex features hard to specify invariants
- Non-atomic operations combine into an atomic state transition
- crash behavior: what happens when system crashes?
- Curious database administrator (DBA)
- an adversary that gains complete control of the database and application servers
- Can prevent the confidentiality of logged-out users' data
- How to minimize the amount of confidential information revealed to the DBMS server while efficiently executing a variety of queries?
- Minimize the amount of data leaked when an adversary compromises the application server in addition to the database server
- Enable running most standard queries on encrypted data
- Protects data of users logged out during attack, even when all database and application servers are compromised
- Modest overhead and no changes to DBMS
- The application issues a plaintext query, with sensitive information
- The proxy intercepts and rewrites: anonymizes table and column name, using master key to encrypt constant in the query
- Proxy checks if DB server should be given keys to adjust encryption layers before executing the query
- send encryption key if needed
- The DBMS does query as usual, which the proxy decrypt and return result to the application
Random (RND)
- provides maximum security in CryptDB
- indistinguishability under an adaptive chosen-plaintext attack
- Not efficient for any computation
Deterministic (DET)
- provides weaker guarantee than random, but still strong security
- leaks only which encrypted values correspond to the same data value, by deterministically generating the same ciphertext for the same plaintext
- Allows equality checks
Order-preserving encryption (OPE)
- provides weaker guarantee than DET,
- leaks order relations between data items without revealing the data itself
- allow query that involves ordering
Homomorphic encryption (HOM)
- Allow the server to perform computations on encrypted data
- final result can be decrypted at the proxy
- allow query that involves computation on data
- as secure as RND
Join (JOIN)
- An encryption scheme used to join two columns
- DET is not sufficient because CryptDB uses different keys for different columns
- Allow join between columns
Word search (SERACH)
- Used to perform searches on encrypted text
- nearly as secure as RND
- Allow LIKE operations
- Goal: use the most secure encryption schemes that enable running the requested queries
Onions of Encryption
- Encrypted in layers (like an onion)
- the outer most is the most secure
- inner ones are less secure, but provide more functionality
- CryptDB dynamically adjusts the layer of encryption on the DBMS server for specific queries that a user cares about
- Data items can be decrypted only through chain of keys rooted in password of the user
- Data only available when user is logged in
- Users logged out during an attack will be safe (if the attacker doesn't know the password of the user)
- What problem does this paper address?
- How to enable queries on encrypted data while minimizing the amount of sensitive data leaked to the database server?
- What is the author's main insight?
- Use onions of layered encryption with a the outmost layer being the most secure
- Use different encryption schemes to enable different queries (e.g. equality, ordering, etc.)
- Use different keys for different users, so security isolation
- What are the paper's key strengths?
- prevent curious DBA and adversary attacks on non-logged in users
- What are the paper's key weaknesses?
- not effective against passive persistent attacks
- Computation may leak information as different data may result in different access pattern
- Oblivious computation: construct computation such that data access pattern is independent of data values
- Enclave is a private region of memory: can still leak information (e.g. computation pattern)!
- Adversary can control the cloud provider's software stack
- observe and modify network traffic
- root access to OS: observe + modify data and accesses to unprotected memory
- Rollback attack: restore sealed enclave data to a previous state
- Encryption mode: data encryption + authentication + computation integrity
- Oblivious mode: encryption mode + protects data content during computation
- Use oblivious algorithms to prevent access pattern information leakage
- Enclave is not enough, side-channel attacks are possible
- Exploit side effects of out of order attacks in order to get private data
- Attacker uses micro-architecture
- e.g. branch predictor or branch target buffer for saving secret
- e.g. cache for recalling secrete
- Victim loads secret under mis-speculation
- Load should NOT trap
- Still inappropriate if managed language or sandbox
- Victim saves secret in micro-arch state, e.g. cache
- Attacker recalls secret from micro-arch state
- Many existing designs are vulnerable
- Can let Javascript steal from Chrome
- Software: add hardware support to disable branch prediction when important
- Performance cost
Force vs. No Force
- Related to durability
- Force: write to disk when commit. Slow
- No Force: NOT writing to disk when commit. A committed transaction may be lost before written to disk. Need REDO.
Steal vs. No Steal
- Related to atomicity
- Steal: uncommitted data CAN be written to disk. An uncommitted transaction needs to be UNDO.
- No Steal: uncommitted data cannot be written to disk. Slow
- Log: an ordered list of REDO/UNDO actions
- Logging: Record REDO and UNDO information for every update
- sequential writes to log (e.g. a separate disk)
- minimal info to log, so multiple updates in a single log page
- A log record:
- <XID, pageID, offset, length, old data, new data>
- For abstract types, have operations(args) instead of old/new data
- Atomicity:
- Must force log record of an update to disk before corresponding data page is written to disk
- Durability:
- Must force all log records for a transaction to disk before it commits
- Log every update (even UNDOs in transaction abort)
- Upon restart,
- REDO all history without backtracking (all history includes aborted transactions)
- UNDO aborted transactions
Log Sequence Number
- Each log record has a unique log sequence number (LSN)
- Each data page has a pageLSN,
- The LSN of the most recent log record for an update to that page
- System keeps a flushedLSN
- The max LSN flushed so far
Other Log-related state
- Transaction table:
- one entry per active transaction
- contains XID, status, and LastLSN
- Dirty page table:
- One entry per dirty page in the buffer pool
- Contains the LSN that first dirties the page
- reads and writes followed by commit or abort
- assume page write is atomic on disk
- Strick 2PL for writes for concurrency control
- STEAL, No-Force buffer management, with Write-Ahead Logging
-
Checkpoint metadata rather than data
- periodically checkpoint to minimize the recovery time after a crash
-
begin_checkpoint record: indicates when checkpoint begins
-
end_checkpoint record: contains current transaction table and dirty page table
-
During checkpointing:
- other transactions continue to run, so state represent a mix of state after begin_checkpoint
- No attempt to force dirty page to disk, so effectiveness of checkpointing is limited by the oldest unwritten change to dirty page
- We want to undo the updates by the aborted transaction
- Start by finding its last log using LastLSN in the transaction table
- using prevLSN of the the log to find its previous log records
- before starting UNDO, write an abort log record
- Before restoring old value, write a Compensation Log Record (CLR)
- continue logging while undo
- At end of all UNDOs, write an "end" log record
- Write commit to log
- Make sure the transaction's all logs are flushed to disk
- Need to guarantee LastLSN <= flushedLSN
- Log flushes are sequential, synchronous writes to disk
- Many log records per page
- Make transaction visible
- drop all locks
- Write "end" log record in the end
Big Picture
- Start from a checkpoint
- Three phases:
- Figure out which transactions committed since checkpoint, which failed
- REDO all actions
- UNDO effects of failed transactions
1. Analysis Phase
- Reconstruct state at checkpoint
- via end_checkpoint record
- Scan log forward from begin_checkpoint
- End record: remove Xact from Xact table
- Update record: if P not in dirty page table, add it
- Other record: add Xact to Xact table, change Xact status on commit
2. REDO Phase
- Repeat history to reconstruct state at crash
- Reapply all updates (even of aborted Xacts), redo CLRs
- Scan forward from log record containing smallest recLSN in D.P.T.
- For each REDO:
- Reapply logged action
- Set pageLSN to LSN
- at the end of REDO pass, the database has a state that reflects exactly everything on a stable log
3. UNDO Phase
- Backward processing log records, undo each record and generate CLR
- avoid undo what was already undone using CLR's undoNextLSN
- How to limit the amount of work in REDO?
- periodically flush dirty pages in the background
- How to limit the amount of work in UNDO?
- Avoid long-running Xacts
- Actual data (or data diff)
- operation(args)
- describes changes to a specific page, logically within that page
Thread-based concurrency
- There is a thread pool, and a scheduler assigns tasks to threads
- Use concurrency mechanism for critical region
- Advantages:
- easy to implement
- Disadvantages:
- As the number of threads increases, contention for resources and context switches cause high overhead
Event-based concurrency
- There is only one thread (per CPU) running, callbacks are used to run tasks asynchronously
- Subsequent tasks are scheduled after the previous task finishes
- no concurrency control needed
- Advantages:
- efficient and highly scalable
- Disadvantage:
- Hard to engineer and tune
- scheduling is challenging
-
Combine thread-based and event-based concurrency
- achieve both ease for programming and extensive concurrency
-
Key Idea: decompose service into stages separated by queues
- Each stage performs a subset of request processing
![image-20210811172732377](/Users/hanminglu/Library/Application Support/typora-user-images/image-20210811172732377.png)
-
Stage:
- event queues can use various control policies
- Modularity: each stage implemented and managed independently
- Facilitates spotting bottlenecks and debugging
-
Controller:
- thread pool controller: ideal degree of thread-based concurrency at each stage
- easy for programming
- batching controller: controls the number of events in a batch for the stage
- observe queue performance of subsequent stages and adjust batching factor
- thread pool controller: ideal degree of thread-based concurrency at each stage
- What problem does this paper address?
- How to achieve both extensive scalability and ease of programming?
- What is the author's main insight?
- Decompose service into stages, separated by queues
- Combine thread-based and event-based concurrency
- What are the paper's key strengths?
- highly scalable
- easier to program than event-based concurrency
- modularity
- adaption to load variations
- What are the paper's key weaknesses?
- Increased latency for traversing multiple stages and experience multiple context switches
- and queuing
- average performance compared to event-driven?
- programming still harder than thread-based
- Increased latency for traversing multiple stages and experience multiple context switches
- Lock wait time: The tasks may lock shared data structures, so increasing the number of cores increase the lock wait time
- Cache coherence time: Tasks may write a shared memory location, so increasing the number of cores increases the time spent waiting for cache coherence protocol
- Higher cache miss rate: tasks fight for shared cache
- Other resource contention
- Make all OS-level inter-core communication explicit
- instead of shared memory, use message passing
- No memory is shared between the code running on each core, except used for messaging channels
- Make OS structure hardware dependent
- Replicate state across cores instead of sharing state
- instead of sharing memory, each core should have its own replica of state
- Consistency between these states are maintained via exchanging messages
- When a remote procedure call is invoked:
- calling environment is suspended, parameters are passed across the network
- desired procedure is executed remotely
- when the procedure finishes and produces its results, the result is sent back to the calling environment, where execution resumes
- the user makes a "perfectly" normal local call
- It invokes a corresponding procedure in the user-stub
- the user-stub places a specification of the target procedure and arguments into packets
- RPC Runtime transmits reliably to the callee machine
- the server's RPC Runtime receives the packets, send to server-stub
- server-stub unpacks the procedure and arguments, does a "perfectly normal" local call, which invokes the procedure on the server machine
- after server procedure finishes, use the same mechanism to send back results
- calling procedure is suspended before receiving the results
Questions:
- Naming: How does a client of the binding mechanism specify what he wants to be bound to?
- Location: How does a caller determine the machine address of the callee? How to specify which remote procedure to be invoke?
Naming & Location
- Binding operation binds an importer of an interface to a (remote) exporter of an instance
- type: what to implement
- instance: which specific machine that implements it
- Use a distributed database to store the exported interfaces (naming) and their (locations).
- Server publishes interface to the database, and clients contacts the database for binding interface
- Latency is very important
- No TCP
- No connection setup or tear down
- Guarantee that if the call returns, then exactly sent once
- if abort, either one or zero time
- If server RPCRuntime responds, no upper limit on wait time
- e.g. deadlocks or loops
- If communication breakdown or server crashes, abort
- What problem does this paper address?
- Implementation of a practical RPC that has reasonably low latency
- What is the author's main insight?
- user and server run normal local call, use user-stub and server-stub to implement RPC
- No shared memory
- for Transport protocol, use reliable protocol with no setup and tear down to minimize latency
- Flexible binding
- exception handling
- What are the paper's key strengths?
- Low latency
- Higher abstraction, easier to use for programmers
- What are the paper's key weaknesses?
- more overhead (marshalling/unmarshalling) compared to message passing
- Most RPC in distributed OS is 1) local between different protection domains, and 2) small
- Local RPC can be optimized using 4 techniques:
- Simple control transfer: allows client's thread to continue running in server's domain
- Simple data transfer: Shared buffers are pre-allocated (at the time of interface binding), and the caller needs to copy data to the stack, and callee can just use the stack
- Simple stub: use stub generator that produces simple stubs in special language
- Design for concurrency: cache domains on multiprocessors, idle cores can be used to cache top protection domains
- Client calls procedure => kernel trap
- caller writes arguments to shared argument stack
- Kernel validates caller
- Binds to server
- OS dispatches client code in server domain (client runs in server's address space with a shared argument stack)
- A message contains a pointer to code which is a pre-registered handler for the message, which then combines the message content to start computation as soon as possible
- pros:
- Async communication
- No buffering except in networking
- Improved performance
- Handlers are kept simple: an interface between network and computation
-
blocking send/receive:
- requires setup overhead, blocking
- low utilization of network bandwidth, high latency
-
unblocking send/receive with buffering:
- instantaneous return to user
- buffer messages at send and receive, higher latency overhead
-
RPC:
- Active message handler does not perform computation on the network data, but only extracts data from the network and adds it to the existing computation
- Use client caching to improve availability during disconnected operation
- Hoarding (connected state), clients hoard manually specified and recent files locally on their disk
- Whole file caching: simplifies consistency implementation
- Optimistic consistency model with callback-based consistency where server tracks and notifies client caches that they are dritied
- Serverside replication: optimistic, first-class replica, read-write replicas on several servers
- Clientside replication: optimistic, second-class replica, clients hoard files locally and prepare for disconnected operation
- When connected, callbacks ensure consistency: clients are notified when their copy is dirtied
- Disconnected operations will operate on local cached copy, and propagated to servers while reconnection
- Tradeoff between availability and consistency: when system is unavailable, enter reduced consistency operation
- Scalability through
- whole-file caching: open cache misses on open, not on others (e.g. read, write, seek, close)
- callback cache coherence: consistency among server and client replicas
- Put more functionality on clients
Pessimistic
- pros: always consistent.
- cons: when disconnected, becomes completely unavailable. a client could lock up entire directory and become disconnected.
Optimistic
- pro: always readable/writable. High availability, high throughput
- cons: conflict resolution is costly, or even violates transparency
Leases
- pro: avoids the issue that a disconnected client can lock up entire directory
- con: when lease expires, disconnected client still cannot operate. updated files could be lost if disconnected.
- It is important to choose which pages are stored in memory:
- If too many pages of a process are stored in the memory, then fewer other processes can be ready at any one time
- if too few pages of a process are stored in the memory, then page faults happen too frequently, thus the number of active processes currently executing (i.e. not waiting for page fault) approaches zero
- Working set model: the process is in RAM iff all the pages that it is currently using is in RAM
- working set W(t, ω): the set of pages accessed between the time interval (t-ω, t), working set strategies try to keep this in RAM
- all or nothing: if some is not in RAM, then the process is paged out to disk, freeing memory for other processes to use
- Thrashing: if too many processes are allowed to run in a time slice, then the number of pages referenced more than there is RAM, causing thrashing
- If too many processes, swap some to disk so that all processes left have a complete working set.
- All processes (even swapped out ones) will complete faster than running all at once
- Working set strategy pros:
- prevent thrashing - optimize CPU utilization & throughput
- optimize the number of parallel processes - optimize throughput
- Keeping a list of past ω page accesses has overhead
- Use a timestamp of t-ω, all pages accessed in t-ω is in working set
- Limit a fault heavy process' impact on other processes - use process-local page replacement
- Program slowly fault into RAM - use spatial locality-based batched paging
- Increase disk throughput - batched page fault and dirty page write
- Processor time searching page list - use TLB to cache virtual-physical mapping
- Address split in half:
- per-process address space: divided into program and stack
- kernel address space - shared among all processes
- Each user address lookup takes two lookups: 1) find user space page table in system page table (in kernel address space); 2) find user space address
- Leave the first page empty to catch uninitialized pointers
- Each process has a limited resident set of pages
- all page replacement happen in the per-process resident set
- FIFO is used since CPU is more valuable than memory
- When pages are removed from resident set:
- if unmodified, add to the end of free list
- pro: when allocate new, avoid searching through the page table
- if modified, add to the end of modified list
- pros: 1) written to disk in batch, better disk throughput; 2) another layer of caching if it is paged before written to disk
- if unmodified, add to the end of free list
- A page fault reads a cluster of extra pages into memory:
- low marginal cost to bring a cluster of pages
- better spatial locality
- Programs are swapped to disk and back both as a whole unit
- avoid that a missing resident set reduces process' performance during execution
- All information is kept in DRAM at all times
- RAMCloud must scale automatically to support thousands of storages servers
- RAMCloud must be as durable as if it were stored on disk
- It is possible to reduce RPC latency of 300-500 μs to 5-10 μs
- Need switches to achieve less than 1μs routing time
- Need a core dedicated to polling from network interface, so eliminate interrupt and context switch
- Need to circumvent OS or VMM+OS to have packets directly mapped to application address space
-
Buffered logging:
- One primary server, two backup servers
- When a write arrives at the primary, the primary forwards a log entry of the write to backup servers
- Backup servers reply with ack after the log entry is in its RAM
- The primary can now return write
- Backup servers periodically save batched logs to disk to utilize disk bandwidth
- Checkpointing periodically as in a normal log
- The number of concurrent transactions will reduce due to lower processing time
- With less concurrent transactions, can use optimistic concurrency control
- When the scheduler is evaluating which node to execute the head-of-line job, it evaluates if current available nodes have the input data required to execute the job
- If yes, schedule on that node
- If no, delay scheduling this node and evaluate other jobs
- If the job is skipped enough times, scheduler will relax the requirement to rack-level input data, eventually becomes global-level input data
- The tradeoff here is fairness and data locality
- This improves overall performance when the workload contains a lot of small jobs (so data locality nodes are available soon)
- What problem does this paper address?
- What is the author's main insight?
- What are the paper's key strengths?
- What are the paper's key weaknesses?