-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify description of structured vs unstructured meta-information lines #620
Changes from all commits
cf4e7ac
6f8b89a
a8e8c2a
2da6d4e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -100,24 +100,38 @@ \subsection{Data types} | |
For the Integer type, the values from $-2^{31}$ to $-2^{31}+7$ cannot be stored in the binary version and therefore are disallowed in both VCF and BCF, see \ref{BcfTypeEncoding}. | ||
|
||
\subsection{Meta-information lines} | ||
File meta-information is included after the \#\# string and must be key=value pairs. | ||
Meta-information lines are optional, but if they are present then they must be completely well-formed. | ||
Note that BCF, the binary counterpart of VCF, requires that all entries are present. | ||
It is recommended to include meta-information lines describing the entries used in the body of the VCF file. | ||
File meta-information lines start with ``\verb|##|'' and must appear first in the VCF file, before the header line (section~\ref{header-line}) and data record lines (section~\ref{data-lines}). | ||
They may be either \emph{unstructured} or \emph{structured}. | ||
|
||
An \emph{unstructured} meta-information line consists of a~\emph{key} (denoting the type of meta-information recorded) and a~\emph{value} (which may not be empty and must not start with a `\verb|<|' character), separated by an `\verb|=|' character: | ||
\begin{quote} | ||
\verb|##|\emph{key}\verb|=|\emph{value} | ||
\end{quote} | ||
Several unstructured meta-information lines are defined in this specification, notably \verb|##fileformat|. | ||
Others not defined by this specification, e.g.\ \verb|##fileDate| and \verb|##source|, are commonly found in VCF files. | ||
These typically have meanings that are obvious, or they are immaterial for processing the file, or both. | ||
|
||
All structured lines that have their value enclosed within ``$<>$'' require an ID which must be unique within their type. | ||
For all of the structured lines (\#\#INFO, \#\#FORMAT, \#\#FILTER, etc.), extra fields can be included after the default fields. | ||
A \emph{structured} meta-information line is similar, but the value is itself a comma-separated list of key=value pairs, enclosed within `\verb|<|' and `\verb|>|' characters: | ||
\begin{quote} | ||
\verb|##|\emph{key}\verb|=<|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\ldots\verb|>| | ||
\end{quote} | ||
All structured lines require an ID which must be unique within their type, i.e., within all the meta-information lines with the same ``\verb|##|\emph{key}\verb|=|'' prefix. | ||
For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.) described in this specification, extra fields can be included after the default fields. | ||
For example: | ||
\begin{verbatim} | ||
##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="description",Version="128"> | ||
##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="source",Version="128"> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I appreciate this hasn't changed and has come from the earlier version, but given it's an example I think it would be clearer with actual examples. Especially as "number" isn't a valid number, but we're using the verbatim style text rather than italic to indicate it's a placeholder term. Eg:
|
||
\end{verbatim} | ||
In the above example, the extra fields of ``Source'' and ``Version'' are provided. | ||
Optional fields must be stored as strings even for numeric values. | ||
The values of optional fields must be written as quoted strings, even for numeric values. | ||
Other structured lines not defined by this specification may also be used; the only default field for such lines is the required \verb|ID| field. | ||
|
||
It is recommended in VCF and required in BCF that the header includes tags describing the reference and contigs backing the data contained in the file. | ||
These tags are based on the SQ field from the SAM spec; all tags are optional (see the VCF example above). | ||
|
||
Meta-information lines can be in any order with the exception of `fileformat` which must come first. | ||
Meta-information lines are optional, but if they are present then they must be completely well-formed. | ||
Other than \verb|##fileformat|, they may appear in any order. | ||
Note that BCF, the binary counterpart of VCF, requires that all entries are present. | ||
It is recommended to include meta-information lines describing the entries used in the body of the VCF file. | ||
|
||
|
||
\subsubsection{File format} | ||
|
@@ -266,6 +280,7 @@ \subsubsection{Pedigree field format} | |
|
||
|
||
\subsection{Header line syntax} | ||
\label{header-line} | ||
The header line names the 8 fixed, mandatory columns. These columns are as follows: | ||
\begin{center} | ||
\#CHROM | ||
|
@@ -1306,7 +1321,7 @@ \subsubsection{Clonal derivation relationships} | |
Alternately, if data on the genomes is compiled in a database, a simple pointer can be provided: | ||
|
||
\begin{verbatim} | ||
##pedigreeDB=<url> | ||
##pedigreeDB=URL | ||
\end{verbatim} | ||
|
||
\begin{samepage} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
##fileformat=VCFv4.1 | ||
##CauseOfFailure=Non-valid URL | ||
##pedigreeDB=<ftp://8080:8080/not-valid/host/to/pedigreeDB> | ||
##pedigreeDB=ftp://8080:8080/not-valid/host/to/pedigreeDB | ||
#CHROM POS ID REF ALT QUAL FILTER INFO | ||
1 123 . TC T . . . |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
##fileformat=VCFv4.1 | ||
##pedigreeDB=<ftp://www.ebi.ac.uk:8080/valid/host/to/file.db> | ||
##pedigreeDB=<http://123.0.1.2:8080/valid/host/to/file.db> | ||
##pedigreeDB=ftp://www.ebi.ac.uk:8080/valid/host/to/file.db | ||
##pedigreeDB=http://123.0.1.2:8080/valid/host/to/file.db | ||
#CHROM POS ID REF ALT QUAL FILTER INFO | ||
1 123 . TC T . . . |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
##fileformat=VCFv4.2 | ||
##CauseOfFailure=Non-valid URL | ||
##pedigreeDB=<ftp://8080:8080/not-valid/host/to/pedigreeDB> | ||
##pedigreeDB=ftp://8080:8080/not-valid/host/to/pedigreeDB | ||
#CHROM POS ID REF ALT QUAL FILTER INFO | ||
1 123 . TC T . . . |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
##fileformat=VCFv4.2 | ||
##pedigreeDB=<ftp://www.ebi.ac.uk:8080/valid/host/to/file.db> | ||
##pedigreeDB=<http://123.0.1.2:8080/valid/host/to/file.db> | ||
##pedigreeDB=ftp://www.ebi.ac.uk:8080/valid/host/to/file.db | ||
##pedigreeDB=http://123.0.1.2:8080/valid/host/to/file.db | ||
#CHROM POS ID REF ALT QUAL FILTER INFO | ||
1 123 . TC T . . . |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
##fileformat=VCFv4.3 | ||
##CauseOfFailure=Non-valid URL | ||
##pedigreeDB=<ftp://8080:8080/not-valid/host/to/pedigreeDB> | ||
##pedigreeDB=ftp://8080:8080/not-valid/host/to/pedigreeDB | ||
#CHROM POS ID REF ALT QUAL FILTER INFO | ||
1 123 . TC T . . . |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
##fileformat=VCFv4.3 | ||
##pedigreeDB=<ftp://www.ebi.ac.uk:8080/valid/host/to/file.db> | ||
##pedigreeDB=<http://123.0.1.2:8080/valid/host/to/file.db> | ||
##pedigreeDB=ftp://www.ebi.ac.uk:8080/valid/host/to/file.db | ||
##pedigreeDB=http://123.0.1.2:8080/valid/host/to/file.db | ||
#CHROM POS ID REF ALT QUAL FILTER INFO | ||
1 123 . TC T . . . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which ones are you suggesting are required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of Number, Type, and Description for INFO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my mind, ID+Number/Type/Description are the “default fields” for
##INFO
.IMHO the problem here is that §1.4 starts talking about details of INFO etc that should be in §1.4.2 etc. Disentangling that (and revisiting the mysterious default/extra/optional fields terminology) I think should be a followup to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.