SAMtags.tex

\documentclass[10pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{longtable}
\usepackage[pdfborder={0 0 0},hyperfootnotes=false]{hyperref}
\usepackage[title]{appendix}

\newcommand{\mailtourl}[1]{\href{mailto:#1}{\tt #1}}
\newcommand{\tagvalue}[1]{{\tt #1}}
\newcommand{\tagregex}[1]{{\tt #1}}
\newcommand{\metavar}[1]{{\rm\emph{#1}}}

% Use as, e.g., \cigarops{MID} to produce M/I/D with the operators in \tt
\newcommand*{\cigarops}[1]{\cigaropsAux#1*}
\def\cigaropsAux#1#2*{{\tt #1}\if\relax\detokenize{#2}\relax\else/\cigaropsAux#2*\fi}

\begin{document}

\input{SAMtags.ver}
\title{Sequence Alignment/Map Optional Fields Specification}
\author{The SAM/BAM Format Specification Working Group}
\date{\headdate}
\maketitle
\begin{quote}\small
The master version of this document can be found at
\url{https://github.com/samtools/hts-specs}.\\
This printing is version~\commitdesc\ from that repository,
last modified on the date shown above.
\end{quote}
\vspace*{1em}

\noindent
This document is a companion to the {\sl Sequence Alignment/Map Format
Specification} that defines the SAM and~BAM formats, and to the {\sl CRAM
Format Specification} that defines the CRAM format.\footnote{See
\href{http://samtools.github.io/hts-specs/SAMv1.pdf}{\tt SAMv1.pdf} and
\href{http://samtools.github.io/hts-specs/CRAMv3.pdf}{\tt CRAMv3.pdf}
at \url{https://github.com/samtools/hts-specs}.}
Alignment records in each of these formats may contain a number of optional
fields, each labelled with a {\it tag\/} identifying that field's data.
This document describes each of the predefined standard tags, and discusses
conventions around creating new tags.

\section{Standard tags}

Predefined standard tags are listed in the following table and described
in greater detail in later subsections.
Optional fields are usually displayed as {\tt TAG:TYPE:VALUE}; the {\it type\/}
may be one of
{\tt A} (character),
{\tt B} (general array),
{\tt f} (real number),
{\tt H} (hexadecimal array),
{\tt i} (integer),
or
{\tt Z} (string).

\begin{center}\small
% This table is sorted alphabetically
\begin{longtable}{ccp{12.5cm}}
  \hline
  {\bf Tag} & {\bf Type} & {\bf Description} \\
  \hline
  \endhead
  {\tt AM} & i & The smallest template-independent mapping quality in the template \\
  {\tt AS} & i & Alignment score generated by aligner \\
  {\tt BC} & Z & Barcode sequence identifying the sample \\
  {\tt BQ} & Z & Offset to base alignment quality (BAQ) \\
  {\tt BZ} & Z & Phred quality of the unique molecular barcode bases in the {\tt OX} tag \\
  {\tt CB} & Z & Cell identifier \\
  {\tt CC} & Z & Reference name of the next hit \\
  {\tt CG} & B,I & BAM only: {\sf CIGAR} in BAM's binary encoding if (and only if) it consists of $>$65535 operators \\
  {\tt CM} & i & Edit distance between the color sequence and the color reference (see also {\tt NM}) \\
  {\tt CO} & Z & Free-text comments \\
  {\tt CP} & i & Leftmost coordinate of the next hit \\
  {\tt CQ} & Z & Color read base qualities \\
  {\tt CR} & Z & Cellular barcode sequence bases (uncorrected) \\
  {\tt CS} & Z & Color read sequence \\
  {\tt CT} & Z & Complete read annotation tag, used for consensus annotation dummy features \\
  {\tt CY} & Z & Phred quality of the cellular barcode sequence in the {\tt CR} tag \\
  {\tt E2} & Z & The 2nd most likely base calls \\
  {\tt FI} & i & The index of segment in the template \\
  {\tt FS} & Z & Segment suffix \\
  {\tt FZ} & B,S & Flow signal intensities \\
  {\tt GC} & ? & Reserved for backwards compatibility reasons \\
  {\tt GQ} & ? & Reserved for backwards compatibility reasons \\
  {\tt GS} & ? & Reserved for backwards compatibility reasons \\
  {\tt H0} & i & Number of perfect hits \\
  {\tt H1} & i & Number of 1-difference hits (see also {\tt NM}) \\
  {\tt H2} & i & Number of 2-difference hits \\
  {\tt HI} & i & Query hit index \\
  {\tt IH} & i & Query hit total count \\
  {\tt LB} & Z & Library \\
  {\tt MC} & Z & CIGAR string for mate/next segment \\
  {\tt MD} & Z & String encoding mismatched and deleted reference bases \\
  {\tt MF} & ? & Reserved for backwards compatibility reasons \\
  {\tt MI} & Z & Molecular identifier; a string that uniquely identifies the molecule from which the record was derived \\
  {\tt ML} & B,C & Base modification probabilities \\
  {\tt MM} & Z & Base modifications / methylation  \\
  {\tt MN} & i & Length of sequence at the time {\tt MM} and {\tt ML} were produced \\
  {\tt MQ} & i & Mapping quality of the mate/next segment \\
  {\tt NH} & i & Number of reported alignments that contain the query in the current record \\
  {\tt NM} & i & Edit distance to the reference \\
  {\tt OA} & Z & Original alignment \\
  {\tt OC} & Z & Original CIGAR (deprecated; use {\tt OA} instead) \\
  {\tt OP} & i & Original mapping position (deprecated; use {\tt OA} instead) \\
  {\tt OQ} & Z & Original base quality \\
  {\tt OX} & Z & Original unique molecular barcode bases \\
  {\tt PG} & Z & Program \\
  {\tt PQ} & i & Phred likelihood of the template \\
  {\tt PT} & Z & Read annotations for parts of the padded read sequence \\
  {\tt PU} & Z & Platform unit \\
  {\tt Q2} & Z & Phred quality of the mate/next segment sequence in the {\tt R2} tag \\
  {\tt QT} & Z & Phred quality of the sample barcode sequence in the {\tt BC} tag \\
  {\tt QX} & Z & Quality score of the unique molecular identifier in the {\tt RX} tag \\
  {\tt R2} & Z & Sequence of the mate/next segment in the template \\
  {\tt RG} & Z & Read group \\
  {\tt RT} & ? & Reserved for backwards compatibility reasons \\
  {\tt RX} & Z & Sequence bases of the (possibly corrected) unique molecular identifier \\
  {\tt S2} & ? & Reserved for backwards compatibility reasons \\
  {\tt SA} & Z & Other canonical alignments in a chimeric alignment \\
  {\tt SM} & i & Template-independent mapping quality \\
  {\tt SQ} & ? & Reserved for backwards compatibility reasons \\
  {\tt TC} & i & The number of segments in the template \\
  {\tt TS} & A & Transcript strand \\
  {\tt U2} & Z & Phred probability of the 2nd call being wrong conditional on the best being wrong \\
  {\tt UQ} & i & Phred likelihood of the segment, conditional on the mapping being correct \\
  {\tt X?} & ? & Reserved for end users \\
  {\tt Y?} & ? & Reserved for end users \\
  {\tt Z?} & ? & Reserved for end users \\
  \hline
\end{longtable}
\end{center}

\subsection{Additional Template and Mapping data}

\begin{description}
\item[AM:i:\tagvalue{score}]
The smallest template-independent mapping quality of any segment in the same template as this read.
(See also {\tt SM}.)

\item[AS:i:\tagvalue{score}]
Alignment score generated by aligner.

\item[BQ:Z:\tagvalue{qualities}]
Offset to base alignment quality (BAQ), of the same length as the read sequence.
At the $i$-th read base, ${\rm BAQ}_i=Q_i-({\rm BQ}_i-64)$ where $Q_i$ is the $i$-th base quality.

\item[CC:Z:\tagvalue{rname}]
Reference name of the next hit; `{\tt =}' for the same chromosome.

\item[CG:B:I,\tagvalue{encodedCigar}]
Real CIGAR in its binary form if (and only if) it contains $>$65535 operations. This is
a BAM file only tag as a workaround of BAM's incapability to store long CIGARs
in the standard way. SAM and CRAM files created with updated tools aware of the
workaround are not expected to contain this tag. See also the footnote in
Section 4.2 of the SAM spec for details.

\item[CP:i:\tagvalue{pos}]
Leftmost coordinate of the next hit.

\item[E2:Z:\tagvalue{bases}]
The 2nd most likely base calls. Same encoding and same length as {\sf SEQ}.
See also {\tt U2} for associated quality values.

\item[FI:i:\tagvalue{int}]
The index of segment in the template.

\item[FS:Z:\tagvalue{str}]
Segment suffix.

\item[H0:i:\tagvalue{count}]
Number of perfect hits.

\item[H1:i:\tagvalue{count}]
Number of 1-difference hits (see also {\tt NM}).

\item[H2:i:\tagvalue{count}]
Number of 2-difference hits.

\item[HI:i:\emph{i}]
Query hit index, indicating the alignment record is the $i$-th one stored
in SAM.

\item[IH:i:\tagvalue{count}]
Number of alignments stored in the file that contain the query in the current
record.

\item[MC:Z:\tagvalue{cigar}]
CIGAR string for mate/next segment.

\item[MD:Z:\tagregex{[0-9]+(([A-Z]|\char92\char94[A-Z]+)[0-9]+)*}]
\hfill\\
String encoding mismatched and deleted reference bases, used in conjunction with the {\sf CIGAR} and {\sf SEQ} fields to reconstruct the bases of the reference sequence interval to which the alignment has been mapped.
This can enable variant calling without requiring access to the entire original reference.

The {\tt MD} string consists of the following items, concatenated without additional delimiter characters:
\begin{itemize}
\item \verb"[0-9]+", indicating a run of reference bases that are identical to the corresponding {\sf SEQ} bases;
\item \verb"[A-Z]", identifying a single reference base that differs from the {\sf SEQ} base aligned at that position;
\item \verb"\^[A-Z]+", identifying a run of reference bases that have been deleted in the alignment.
\end{itemize}

As shown in the complete regular expression above, numbers alternate with the other items.
Thus if two mismatches or deletions are adjacent without a run of identical bases between them, a `{\tt 0}' (indicating a 0-length run) must be used to separate them in the {\tt MD} string.

Clipping, padding, reference skips, and insertions (`{\tt H}', `{\tt S}', `{\tt P}', `{\tt N}', and `{\tt I}' {\sf CIGAR} operations) are not represented in the {\tt MD} string.
When reconstructing the reference sequence, inserted and soft-clipped {\sf SEQ} bases are omitted as determined by tracking `{\tt I}' and `{\tt S}' operations in the {\sf CIGAR} string.
(If the {\sf CIGAR} string contains `{\tt N}' operations, then the corresponding skipped parts of the reference sequence cannot be reconstructed.)

For example, a string `\verb"10A5^AC6"' means
from the leftmost reference base in the alignment, there are 10 matches
followed by an A on the reference which is different from the aligned read
base; the next 5 reference bases are matches followed by a 2bp deletion from
the reference; the deleted sequence is AC; the last 6~bases are matches.

\item[MQ:i:\tagvalue{score}]
Mapping quality of the mate/next segment.

\item[NH:i:\tagvalue{count}]
Number of reported alignments that contain the query in the current record.

\item[NM:i:\tagvalue{count}]
Number of differences (mismatches plus inserted and deleted bases) between the sequence and reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential matches, with everything else being a mismatch.
Note this means that ambiguity codes in both sequence and reference that match each other, such as `{\tt N}' in both, or compatible codes such as `{\tt A}' and `{\tt R}', are still counted as mismatches.
The special sequence base `{\tt =}' will always be considered to be a match, even if the reference is ambiguous at that point.
Alignment reference skips, padding, soft and hard clipping (`{\tt N}', `{\tt P}', `{\tt S}' and `{\tt H}' {\sf CIGAR} operations) do not count as mismatches, but insertions and deletions count as one mismatch per base.

Note that historically this has been ill-defined and both data and tools exist that disagree with this definition.

\item[PQ:i:\tagvalue{score}]
Phred likelihood of the template, conditional on the mapping locations of both/all segments being correct.

\item[Q2:Z:\tagvalue{qualities}]
Phred quality of the mate/next segment sequence in the {\tt R2} tag.
Same encoding as {\sf QUAL}.

\item[R2:Z:\tagvalue{bases}]
Sequence of the mate/next segment in the template.  See also {\tt Q2}
for any associated quality values.

\item[SA:Z:\tagregex{{\tt (}\emph{rname}{\tt ,}\emph{pos}{\tt ,}\emph{strand}{\tt ,}\emph{CIGAR}{\tt ,}\emph{mapQ}{\tt ,}\emph{NM}{\tt ;)}+}]
Other canonical alignments in a chimeric alignment, formatted as a semicolon-delimited list.
Each element in the list represents a part of the chimeric alignment. Conventionally, at a supplementary line, the first element points to the primary line.
\emph{Strand} is either `{\tt +}' or `{\tt -}', indicating forward/reverse strand, corresponding to FLAG bit 0x10.
\emph{Pos} is a 1-based coordinate.

\item[SM:i:\tagvalue{score}]
Template-independent mapping quality, i.e., the mapping quality if the read were mapped as a single read rather than as part of a read pair or template.

\item[TC:i:\tagvalue{}]
The number of segments in the template.

\item[TS:A:\tagvalue{strand}]
Strand (`{\tt +}' or `{\tt -}') of the transcript to which the read has been mapped.

\item[U2:Z:\tagvalue{}]
Phred probability of the 2nd call being wrong conditional on the best being wrong.
The same encoding and length as {\sf QUAL}.  See also {\tt E2} for associated base calls.

\item[UQ:i:\tagvalue{}]
Phred likelihood of the segment, conditional on the mapping being correct.
\end{description}

\subsection{Metadata}

\begin{description}
\item[RG:Z:\tagvalue{readgroup}]
The read group to which the read belongs.
If {\tt @RG} headers are present, then \emph{readgroup} must match the
{\tt RG-ID} field of one of the headers.

\item[LB:Z:\tagvalue{library}]
The library from which the read has been sequenced.
If {\tt @RG} headers are present, then \emph{library} must match the
{\tt RG-LB} field of one of the headers.

\item[PG:Z:\tagvalue{program\_id}]
Program. Value matches the header {\tt PG-ID} tag if {\tt @PG} is present.

\item[PU:Z:\tagvalue{platformunit}]
The platform unit in which the read was sequenced.
If {\tt @RG} headers are present, then \emph{platformunit} must match the
{\tt RG-PU} field of one of the headers.

\item[CO:Z:\tagvalue{text}]
Free-text comments.
\end{description}

\subsection{Barcodes}
DNA barcodes can be used to identify the provenance of the underlying reads.
There are currently three varieties of barcodes that may co-exist: Sample Barcode, Cell Barcode, and Unique Molecular Identifier (UMI).
\begin{itemize}
\item
Despite its name, the \emph{Sample Barcode} identifies the \emph{Library} and allows multiple libraries to be combined and sequenced together.
After sequencing, the reads can be separated according to this barcode and placed in different ``read groups'' each corresponding to a library.
Since the library was generated from a sample, knowing the library should inform of the sample.
The barcode itself can be included in the {\tt PU} field in the {\tt RG} header line.
Since the {\tt PU} field should be globally unique, it is advisable to include specific information such as flowcell barcode and lane.
It is not recommended to use the barcode as the {\tt ID} field of the {\tt RG} header line, as some tools modify this field (e.g., when merging files).
\item
The \emph{Cell Barcode} is similar to the sample barcode but there is (normally) no control over the assignment of cells to barcodes (whose sequence could be random or predetermined).
The Cell Barcode can help identify when reads come from different cells in a ``single-cell'' sequencing experiment.

\item
The \emph{UMI} is intended to identify the (single- or double-stranded) molecule at the time that the barcode was introduced.
This can be used to inform duplicate marking and make consensus calling in ultra-deep sequencing.
Additionally, the UMI can be used to (informatically) link reads that were generated from the same long molecule, enabling long-range phasing and better informed mapping.
In some experimental setups opposite strands of the same double-stranded DNA molecule get related barcodes.
These templates can also be considered duplicates even though technically they may have different UMIs.
Multiple UMIs can be added by a protocol, possibly at different time-points, which means that specific knowledge of the protocol may be needed in order to analyze the resulting data correctly.
\end{itemize}

\begin{description}
\item[BC:Z:\tagvalue{sequence}]
Barcode sequence (Identifying the sample/library), with any quality scores (optionally) stored in the {\tt QT} tag.
The {\tt BC} tag should match the {\tt QT} tag in length. 
In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes and places a hyphen (`{\tt -}') between the barcodes from the same template. 

\item[QT:Z:\tagvalue{qualities}] 
Phred quality of the sample barcode sequence in the {\tt BC} tag.
Same encoding as {\sf QUAL}, i.e., Phred score + 33.
In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the quality strings with spaces (`{\tt \textvisiblespace}') between the different strings from the same template. 

\item[CB:Z:\tagvalue{str}]
Cell identifier, consisting of the optionally-corrected cellular barcode sequence and an optional suffix.
The sequence part is similar to the {\tt CR} tag, but may have had sequencing errors etc corrected.
This may be followed by a suffix consisting of a hyphen (`{\tt -}') and one or more alphanumeric characters to form an identifier.
In the case of the cellular barcode ({\tt CR}) being based on multiple barcode sequences the recommended implementation concatenates all the (corrected or uncorrected) barcodes with a hyphen (`{\tt -}') between the different barcodes.
Sequencing errors etc aside, all reads from a single cell are expected to have the same {\tt CB} tag.

\item[CR:Z:\tagvalue{sequence+}]
Cellular barcode. The uncorrected sequence bases of the cellular barcode as reported by the sequencing machine, with the corresponding base quality scores (optionally) stored in {\tt CY}.
Sequencing errors etc aside, all reads with the same {\tt CR} tag likely derive from the same cell.
In the case of the cellular barcode being based on multiple barcode sequences the recommended implementation concatenates all the barcodes with a hyphen (`{\tt -}') between the different barcodes.

\item[CY:Z:\tagvalue{qualities+}]
Phred quality of the cellular barcode sequence in the {\tt CR} tag.
Same encoding as {\sf QUAL}, i.e., Phred score + 33.
The lengths of the {\tt CY} and {\tt CR} tags must match.
In the case of the cellular barcode being based on multiple barcode sequences the recommended implementation concatenates all the quality strings with with spaces (`{\tt \textvisiblespace}') between the different strings.

\item[MI:Z:\tagvalue{str}]
Molecular Identifier. 
A unique ID within the SAM file for the source molecule from which this read is derived. 
All reads with the same {\tt MI} tag represent the group of reads derived from the same source molecule. 

\item[OX:Z:\tagvalue{sequence+}] 
Raw (uncorrected) unique molecular identifier bases, with any quality scores (optionally) stored in the {\tt BZ} tag. 
In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes with a hyphen (`{\tt -}') between the different barcodes.

\item[BZ:Z:\tagvalue{qualities+}] 
Phred quality of the (uncorrected) unique molecular identifier sequence in the {\tt OX} tag.
Same encoding as {\sf QUAL}, i.e., Phred score + 33.
The {\tt OX} tags should match the {\tt BZ} tag in length. 
In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the quality strings with a space (`{\tt \textvisiblespace}') between the different strings.

\item[RX:Z:\tagvalue{sequence+}]
Sequence bases from the unique molecular identifier.
These could be either corrected or uncorrected. Unlike {\tt MI}, the value may be non-unique in the file.
Should be comprised of a sequence of bases.
In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes with a hyphen (`{\tt -}') between the different barcodes.

If the bases represent corrected bases, the original sequence can be stored in {\tt OX} (similar to {\tt OQ} storing the original qualities of bases.)

\item[QX:Z:\tagvalue{qualities+}]
Phred quality of the unique molecular identifier sequence in the {\tt RX} tag.
Same encoding as {\sf QUAL}, i.e., Phred score + 33.
The qualities here may have been corrected (Raw bases and qualities can be stored in {\tt OX} and {\tt BZ} respectively.)
The lengths of the {\tt QX} and the {\tt RX} tags must match.
In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the quality strings with a space (`{\tt \textvisiblespace}') between the different strings.
\end{description}

\subsection{Original data}

\begin{description}
\item[OA:Z:\tagregex{(\metavar{RNAME},\metavar{POS},\metavar{strand},\metavar{CIGAR},\metavar{MAPQ},\metavar{NM};)+}]
The original alignment information of the record prior to realignment or unalignment by a subsequent tool.
Each original alignment entry contains the following six field values from the original record, generally in their textual SAM representations, separated by commas (`{\tt ,}') and terminated by a semicolon (`{\tt ;}'):
{\sf RNAME}, which must be explicit (unlike {\sf RNEXT}, `{\tt =}' may not be used here);
1-based {\sf POS};
`{\tt +}' or `{\tt -}', indicating forward/reverse strand respectively (as per bit~0x10 of {\sf FLAG});
{\sf CIGAR};
{\sf MAPQ};
{\tt NM} tag value, which may be omitted (though the preceding comma must be retained).

In the presence of an existing {\tt OA} tag, a subsequent tool may append another original alignment entry after the semicolon,
adding to---rather than replacing---the existing {\tt OA} information.

The {\tt OA} field is designed to provide record-level information that can be useful for understanding the provenance of the information in a record.
It is not designed to provide a complete history of the template alignment information.
In particular, realignments resulting in the the removal of Secondary or Supplementary records will cause the loss of all tags associated with those records, and may also leave the {\tt SA} tag in an invalid state.

\item[OC:Z:\tagvalue{cigar}]
Original CIGAR, usually before realignment.
Deprecated in favour of the more general {\tt OA}.

\item[OP:i:\tagvalue{pos}]
Original 1-based {\sf POS}, usually before realignment.
Deprecated in favour of the more general {\tt OA}.

\item[OQ:Z:\tagvalue{qualities}]
Original base quality, usually before recalibration.
Same encoding as {\sf QUAL}.
\end{description}

\subsection{Annotation and Padding}

The SAM format can be used to represent \emph{de novo} assemblies, generally by using padded reference sequences and the annotation tags described here.
See the \emph{Guide for Describing Assembly Sequences} in the \href{http://samtools.github.io/hts-specs/SAMv1.pdf}{\emph{SAM Format Specification}} for full details of this representation.

\begin{description}
\item[CT:Z:\tagregex{\metavar{strand};\metavar{type}(;\metavar{key}(=\metavar{value})?)*}]
\hfill\\
Complete read annotation tag, used for consensus annotation dummy features.

The {\tt CT} tag is intended primarily for annotation
dummy reads, and consists of a \emph{strand}, \emph{type} and zero or
more \emph{key}=\emph{value} pairs, each separated with semicolons.
The \emph{strand} field has four values as in GFF3,\footnote{The
Generic Feature Format version 3 (GFF3) specification can be found at
\href{http://www.sequenceontology.org/}{\tt http://sequenceontology.org}.}
and supplements FLAG
bit 0x10 to allow unstranded (`{\tt .}'), and stranded but unknown strand
(`{\tt ?}') annotation. For these and annotation on the forward strand
(\emph{strand} set to `{\tt +}'), do not set FLAG bit 0x10. For
annotation on the reverse strand, set the \emph{strand} to `{\tt -}'
and set FLAG bit 0x10.

The \emph{type} and any \emph{keys} and their
optional \emph{values} are all percent encoded according to
RFC3986 to escape meta-characters `{\tt =}', `{\tt \%}', `{\tt ;}',
`{\tt |}' or non-printable characters not matched by the isprint()
macro (with the C locale). For example a percent sign becomes
`{\tt \%25}'.
%NOTE - This leaves open the possibility of allowing multiple such
%entries for a single CT tag to be combined with | as in the PT tag.

\item[PT:Z:\tagregex{\metavar{annotag}(\char92|\metavar{annotag})*}]\enskip where each \metavar{annotag} matches\quad\tagregex{\metavar{start};\metavar{end};\metavar{strand};\metavar{type}(;\metavar{key}(=\metavar{value})?)*}
\hfill\\
Read annotations for parts of the padded read sequence.

The {\tt PT} tag value has the format of a series of annotation
tags separated by `{\tt |}', each annotating a sub-region of the read.
Each tag consists of \emph{start}, \emph{end}, \emph{strand},
\emph{type} and zero or more \emph{key}{\tt =}\emph{value} pairs, each
separated with semicolons. \emph{Start} and \emph{end} are 1-based
positions between one and the sum of the \cigarops{MIDPS=X}
{\sf CIGAR} operators, i.e., {\sf SEQ} length plus any pads.  Note
any editing of the CIGAR string may require updating the {\tt PT}
tag coordinates, or even invalidate them.
As in GFF3, \emph{strand} is one of `{\tt +}' for forward strand tags,
`{\tt -}' for reverse strand, `{\tt .}' for unstranded or `{\tt ?}'
for stranded but unknown strand.

The \emph{type} and any \emph{keys} and their optional \emph{values}
are all percent encoded as in the {\tt CT} tag.
\end{description}

\subsection{Technology-specific data}

\begin{description}
\item[FZ:B:S,\tagvalue{intensities}]
Flow signal intensities on the original strand of the read, stored as {\tt (uint16\_t) round(value * 100.0)}.
\end{description}

\subsubsection{Color space}

% TODO Describe color space and the encoding here.

\begin{description}
\item[CM:i:\tagvalue{distance}]
Edit distance between the color sequence and the color reference (see also {\tt NM}).

\item[CS:Z:\tagvalue{sequence}]
Color read sequence on the original strand of the read. The primer base must be included.

\item[CQ:Z:\tagvalue{qualities}]
Color read quality on the original strand of the read. Same encoding as {\sf QUAL}; same length as {\tt CS}.
\end{description}

\subsection{Base modifications}

Base modifications, including base methylation, are represented as a series of edits from the primary unmodified sequence as originally reported by the sequencing instrument.
This potentially differs to the sequence stored in the main SAM {\sf SEQ} field if the latter has been reverse complemented, in which case SAM {\sf FLAG} 0x10 must be set.
This means modification positions are also recorded against the original orientation (i.e. starting at the 5' end), and count the original base types.

Each modified base prediction listed also has a quality value associated with it.
Given the unmodified base already has a phred likelihood, this base modification quality should be interpreted as the likelihood of this modification being correct given an assumption the original call is correct.

\begin{description}
\item[MM:Z:\tagregex{([ACGTUN][-+]([a-z]+|[0-9]+)[.?]?(,[0-9]+)*;)*}]
\hfill\\
The first character is the unmodified ``fundamental'' base as reported
by the sequencing instrument for the top strand.
It must be one of `{\tt A}', `{\tt C}', `{\tt G}', `{\tt T}', `{\tt U}' (if RNA) or `{\tt N}' for anything else, including any IUPAC ambiguity codes in the reported SEQ field.
Note `{\tt N}' may be used to match any base rather than specifically an `{\tt N}' call by the sequencing instrument.
This may be used in situations where the base modification is not a derivation of a standard base type.
This is followed by either plus or minus indicating the strand the modification was observed on (relative to the original sequenced strand of {\sf SEQ} with plus meaning same orientation),\footnote{Hence a tool that may reverse complement sequences does not need to understand how to manipulate the {\tt MM} and {\tt ML} tags.} and one or more base modification codes.

Following the base modification codes is a recommended but optional `{\tt .}' or `{\tt ?}' describing how skipped seq bases of the stated base type should be interpreted by downstream tools.
When this flag is `{\tt ?}' there is no information about the modification status of the skipped bases provided.
When this flag is not present, or it is `{\tt .}', these bases should be assumed to have low probability of modification.\footnote{The decision whether a base is assumed to be unmodified or has a probability explicitly provided is up to the modification calling program. Some programs will elide calls with modification probabilities below a threshold to provide a more compact modification tag.}

This is then followed by a comma separated list of how many seq bases of the stated base type to skip, stored as a delta to the last and starting with 0 as the first (or next) base, starting from the uncomplemented 5' end of the {\sf SEQ} field.
This number series is comparable to the numbers in an {\tt MD} tag,
albeit counting specific base types only and potentially reverse-complemented.

For example `{\tt C+m,5,12,0;}' tells us there are three
potential 5-Methylcytosine bases on the top strand of {\sf SEQ}.
The first 5 `{\tt C}' bases are unmodified and the 6th, 19th and 20th have modification status indicated by the corresponding probabilities in the {\tt ML} tag. The 12 cytosines between the 6th and 19th cytosine are unmodified. Modification probabilities for the 17 skipped cytosines are not provided.

When the `{\tt ?}' flag is present the tag `{\tt C+m?,5,12,0;}' tells us the modification status of the first five 
cytosine bases is unknown, the sixth cytosine is called (as either modified or unmodified), followed by 12 more unknown cytosines, and the 19th and 20th are called.

Similarly `{\tt G-m,14;}' indicates the 15th `{\tt G}' there might be a 5-Methylcytosine on the opposite strand (still counting using the top strand base calls from the 5' end).
When the alignment record is reverse complemented (SAM flag 0x10) these two examples do not change since the tag always refers to the as-sequenced orientation.
See the test/SAMtags/MM-orient.sam file for examples.

This permits modifications to be listed on either strand with the rare potential for both strands to have a modification at the same site.
If SAM FLAG 0x10 is set, indicating that SEQ has been reverse complemented from the sequence observed by the sequencing machine, note that these base modification field values will be in the opposite orientation to SEQ and other derived SAM fields.

Note it is permitted for the coordinate list to be empty (for example `{\tt MM:Z:C+m;}'), which may be used as an explicit indicator that this base modification is not present.
It is not permitted for coordinates to be beyond the length of the sequence.

When multiple modifications are listed, for example `{\tt C+mh,5,12,0;}', it indicates the modification may be any of the stated bases.
The associated confidence values in the {\tt ML} tag may be used to determine the relative likelihoods between the options.
The example above is equivalent to `{\tt C+m,5,12,0;C+h,5,12,0;}', although this will have a different ordering of confidence values in {\tt ML}.
Note ChEBI codes cannot be used in the multi-modification form (such as the `{\tt C+mh}' example above).

If the modification is not one of the standard common types (listed below) it can be specified as a numeric ChEBI code.
For example `{\tt C+76792,57;}' is the same as `{\tt C+h,57;}'.

An unmodified base of `{\tt N}' means count any base in {\sf SEQ}, not only those of `{\tt N}'.
Thus `{\tt N+n,100;}' means the 101st base is Xanthosine (n), irrespective of the sequence composition.

The standard code types and their associated ChEBI values are listed
below, taken from Viner {\it et al.}%
\footnote{Coby Viner {\it et al.}, \emph{Modeling methyl-sensitive
transcription factor motifs with an expanded epigenetic alphabet}, \url{https://www.biorxiv.org/content/10.1101/043794v1}.}
Additionally ambiguity codes `{\tt A}', `{\tt C}', `{\tt G}', `{\tt T}' and `{\tt U}'
exist to represent unspecified modifications bases of their respective
canonical base types, plus code `{\tt N}' to represent an unspecified
modification of any base type.

\begin{center}
\begin{tabular}{lllll}
{\bf Unmodified base} & {\bf Code} & {\bf Abbreviation} & {\bf Name} & {\bf ChEBI} \\
\hline
C & m & 5mC   & 5-Methylcytosine        & 27551 \\
C & h & 5hmC  & 5-Hydroxymethylcytosine & 76792 \\
C & f & 5fC   & 5-Formylcytosine        & 76794 \\
C & c & 5caC  & 5-Carboxylcytosine      & 76793 \\
C & C &       & Ambiguity code; any C mod & \\
\hline
T & g & 5hmU  & 5-Hydroxymethyluracil   & 16964 \\
T & e & 5fU   & 5-Formyluracil          & 80961 \\
T & b & 5caU  & 5-Carboxyluracil        & 17477 \\
T & T &       & Ambiguity code; any T mod & \\
\hline
U & U &       & Ambiguity code; any U mod & \\
\hline
A & a & 6mA   & 6-Methyladenine         & 28871 \\
A & A &       & Ambiguity code; any A mod & \\
\hline
G & o & 8oxoG & 8-Oxoguanine            & 44605 \\
G & G &       & Ambiguity code; any G mod & \\
\hline
N & n & Xao   & Xanthosine              & 18107 \\
N & N &       & Ambiguity code; any mod & \\
\end{tabular}
\end{center}

% MP was the former quality score for MM.  However being Phred scores
% it can only reasonable record probabilities for highly likely
% events, making it inappropriate for callers (eg ONT's) that wish to
% jointly call probabilities for the entire trained set of
% possibilities.  We could use log-odds, similar to how early Illumina
% runs did to record likelihoods for A, C, G and T irrespective of
% call, but for now we're using linear-scaled probabilities.  These
% are in the ML tag.
%
% The MP tag is left here for now as the jury is still out on whether
% we'll need it in the future.
%
% \item[MP:Z:\tagvalue{qualities}]
% \hfill\\
% The optional {\tt MP} tag lists the Phred qualities of each modification listed in the {\tt MM} tag in the order they occur.
% The qualities are encoded in the same manner as the primary {\sf QUAL} field; one byte per quality with ASCII value Phred score + 33.
% A space character (`{\tt \textvisiblespace}') should be used as a separator between concatenated quality strings when multiple modification lists are present in the {\tt MM} tag.
% The length should match the number of position deltas from {\tt MM} plus 1 per space character required.
%
% For example ``{\tt MM:Z:C+m,5,12,3;C+h,57;}'' may have an associated
% quality tag of ``{\tt MP:Z:5EB /}''.
%
% Where multiple modification types are listed together, such as in ``{\tt MM:Z:C+mh,5,12,3;}'' the quality values are interleaved in order ({\tt m} at 6, {\tt h} at 6, {\tt m} at 19, {\tt h} at 19 and so on), giving 6 quality values in total for this example.
%
% Quality values for ambiguity codes give the likelihood that the
% modification is one of the possible codes compatible with that
% ambiguity code.  For example {\tt MM:Z:C+C,10; MP:Z:+} indicates a C
% call with an unspecified modification and the phred score of 10 (ASCII
% value {\tt +}).  This corresponds to a 90\% chance of the base being
% modified.
%
% To represent several possible modifications at the same site the {\tt MP} tag can be used to indicate the probabilities of each possibility.
% The values used should be absolute probabilities, not relative between the alternatives.
% For example, a C base that has 95\% chance of being modified with 5mC being three times more likely than 5hmC will encode 5mC with 67.5\% probability ($0.9 * 0.75$ giving phred score 5, ASCII value {\tt \&})and 5hmC with 22.5\% probability ($0.9 * 0.25$ giving phred score 1, ASCII value {\tt "}).
% This could be represented with ``{\tt MM:Z:C+m,10;C+h,10; MP:Z:" \&}''.

\item[ML:B:C,\tagvalue{scaled-probabilities}]
\hfill\\
The optional {\tt ML} tag lists the probability of each modification listed in the {\tt MM} tag being correct, in the order that they occur.
The continuous probability range 0.0 to 1.0 is remapped in equal
sized portions to the discrete integers 0 to 255 inclusively. Thus the
probability range corresponding to integer value $N$ is $N/256$ to
$(N+1)/256$.

The SAM encoding therefore uses a byte array of type `{\tt C}' with the number of elements matching the summation of the number of modifications listed as being present in the {\tt MM} tag accounting for multi-modifications each having their own probability.

For example `{\tt MM:Z:C+m,5,12;C+h,5,12;}' may have an associated tag of `{\tt ML:B:C,204,89,26,130}'.

If the above is rewritten in the multiple-modification form, the probabilities are interleaved in the order presented, giving `{\tt MM:Z:C+mh,5,12;  ML:B:C,204,26,89,130}'.
Note where several possible modifications are presented at the same site, the {\tt ML} values represent the absolute probabilities of the modification call being correct and not the relative likelihood between the alternatives.
These probabilities should not sum to above 1.0 ($\approx 256$ in integer encoding, allowing for some minor rounding errors), but may sum to a lower total with the remainder representing the probability that none of the listed modification types are present.
In the example used above, the 6th {\tt C} has 80\% chance of being {\tt 5mC}, 10\% chance of being {\tt 5hmC} and 10\% chance of being an unmodified {\tt C}.

{\tt ML} values for ambiguity codes give the probability that the modification is one of the possible codes compatible with that ambiguity code.
For example {\tt MM:Z:C+C,10; ML:B:C,229} indicates a C call with a probability of 90\% of having some form of unspecified modification.

\item[MN:i:\tagvalue{length}]
\hfill\\
The length of the {\sf SEQ} field at the time the {\tt MM} value was last written.

Some processing of aligned data, such as the use of hard-clipping tools, may alter {\sf SEQ} sequence data.
If the sequence is shortened in this manner then the base offsets in {\tt MM} and {\tt ML} become invalid unless they are also updated accordingly.

Some hard-clipping tools will update {\tt MM}/{\tt ML} but others do not, so the {\tt MN} tag offers a simple sanity check.
Software that wishes to validate {\tt MM} should compare the length of the {\sf SEQ} field with the contents of the {\tt MN} tag---if they differ, the {\tt MM}~and {\tt ML}~values should be considered out-of-date.
The tag is optional, but recommended, and if it is absent then there is an implicit assumption that the {\tt MM} data is valid unless evidence implies otherwise (e.g., by having coordinates beyond the end of the sequence).

\end{description}

\section{Draft tags}

These are tags which have been proposed and are broadly accepted to
become standard tags, but a review or probationary period has been
deemed useful.  They use the locally-defined tag namespace and
processing software should consider that the tags may have local usage
for other purposes.

\vspace*{1em}
There are currently no tags with draft status.

% \begin{center}\small
% % This table is sorted alphabetically
% \begin{longtable}{ccp{12.5cm}}
%   \hline
%   {\bf Tag} & {\bf Type} & {\bf Description} \\
%   {\tt MP} & Z & Base modification qualities \\
%   \hline
%   \endhead
% \end{longtable}
% \end{center}


\section{Locally-defined tags}

You can freely add new tags.
Note that tags starting with `{\tt X}', `{\tt Y}', or `{\tt Z}' and tags
containing lowercase letters in either position are reserved for local use
and will not be formally defined in any future version of this specification.

If a new tag may be of general interest, it may be useful to have it added
to this specification.  Additions can be proposed by opening a new issue at
\url{https://github.com/samtools/hts-specs/issues} and/or by sending email
to \mailtourl{samtools-devel@lists.sourceforge.net}.

\begin{appendices}
\appendix
\section{Tag History}

This appendix lists when standard tags were initially defined or significantly changed, and other historical events that affect how tags are interpreted or what files they may appear in.

\setlength{\parindent}{0pt}
\newcommand*{\gap}{\vspace*{2ex}}

\subsubsection*{September 2024}

Added the MN tag for validating base modification tag consistency.

\subsubsection*{February 2022}

Base modification tags changed to use the predefined standard names MM and~ML, as their review period has finished.
Programs outputting the draft Mm and~Ml tags should be changed to use MM and~ML instead.

\subsubsection*{December 2021}

Amended draft Mm tag to provide hints about the modification status of skipped sequence bases.

\subsubsection*{July 2021}
Added the Mm and Ml draft tags describing base modifications.

\subsubsection*{March 2020}

Transcript strand tag TS added, equivalent to the locally-defined XS tag
produced by several RNA aligners.

\subsubsection*{January 2019}
Added the OA tag for recording original/previous alignment information.

Deprecated the OC and OP tags.

\subsubsection*{July 2018}

Clarified the calculation of NM score.

\subsubsection*{May 2018}

Cellular barcode tags CB, CR, and CY added.

Removed the RT:Z tag, which was a long-deprecated synonym for BC.

\subsubsection*{November 2017}

SAM version number {\tt VN:1.6} introduced, indicating the addition of the CG tag representation of very long CIGAR strings.
Files that contain records with more than 65,535 CIGAR operators should not declare a version number lower than~1.6 in their {\tt @HD} headers.
% Technically only BAM files containing records with CG tags need to avoid
% declaring VN<1.6, but recommending that SAM and CRAM files with long CIGAR
% strings also declare VN:1.6+ aids file format conversion.

\subsubsection*{August 2017}

Unique molecular identifier tags BZ, MI, OX, QX, and RX added.

Usage of sample barcode tag BC clarified.

\subsubsection*{June 2017}

Corrected the description of the E2 (second-most-likely bases) tag, which was previously unclear as to whether it contains bases or base qualities.

\subsubsection*{September 2016}

Predefined tags, previously listed as a brief table within the main SAM specification, have been split out into this new document.
There is now space for clearer and more complete tag descriptions.

\subsubsection*{February 2014}

MC tag added.

\subsubsection*{May 2013}

SAM version number {\tt VN:1.5} introduced, with limited impact for tags other than indicating that the CT/PT annotation tag definitions are considered finalised.

\gap
SA tag added.

\subsubsection*{March 2012}

Descriptions of CT and PT annotation tags significantly clarified.

\subsubsection*{October 2011}

Sample barcode tags QT and RT added, with RT being identified as a deprecated alternative to BC.

% These were actually added in late September as RT/PT, but RT was changed to
% CT (see samtools-devel, "Potential clash of RT tags (annotation vs barcode)",
% October 2011) before read-annotation-RT appeared in the wild.
Read annotation tags CT and PT added.

\subsubsection*{September 2011}

% This was actually August 29th, but let's call it September.
FZ tag's type changed from {\tt H} to {\tt B,S}-array.

BC and CO tags added.

\subsubsection*{April 2011}

SAM version number {\tt VN:1.4} introduced, indicating the addition of the {\tt B}-array tag type.
Files that contain records with {\tt B}-array fields should not declare a version number lower than~1.4 in their {\tt @HD} headers.

\gap
FZ tag added, with type {\tt H}.

MD tag description changed to allow IUPAC ambiguity codes in addition to {\tt ACGTN}.

\subsubsection*{March 2011}

CC and CP tags reinstated with their original meanings.

\subsubsection*{November 2010}

BQ tag added.

\subsubsection*{July 2010}

The specification was rewritten as a \LaTeX\ document specifying SAM version number {\tt VN:1.3}.

\gap
Tags FI, FS, OC, OP, OQ, and TC added.

Tags GC:Z, GQ:Z, and GS:Z, briefly proposed for representing repeatedly-sequenced reads, noted as reserved for backwards compatibility.
Existing tags MF:i (MAQ pair flag), SQ:H (suboptimal bases), and S2:H (mate's suboptimal bases) removed and noted as reserved for backwards compatibility.

CC and CP tags temporarily removed.

\subsubsection*{July 2009}

\begin{samepage}
The original SAM ``0.1.2-draft'' specification specified version number {\tt VN:1.0} and defined a total of thirty standard tags (though SQ and S2 were already deprecated in favour of E2 and U2):

\begin{center}
\begin{tabular}{l*{9}{@{\qquad}l}}
AM & CM & CS & H1 & IH & MF & NM & PU & RG & SQ \\
AS & CP & E2 & H2 & LB & MQ & PG & Q2 & S2 & U2 \\
CC & CQ & H0 & HI & MD & NH & PQ & R2 & SM & UQ
\end{tabular}
\end{center}
\end{samepage}

\end{appendices}

\end{document}