From cf4e7ac981c4f5c4766143f4834bfe36921d60aa Mon Sep 17 00:00:00 2001 From: John Marshall Date: Tue, 4 Jan 2022 17:38:27 +0000 Subject: [PATCH 1/4] Clarify structured vs unstructured meta-information lines Introduce the term "unstructured meta-information line", and reword this section so it describes the two flavours of meta-information line clearly. Specify that an unstructured value must not start with `<` (so that structured/unstructured are easily distinguished) and must be non-empty. Remove `<>` from unstructured `##pedigreeDB` example. PR #88 removed the `<>` from one instance of `##pedigreeDB=` presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally, but missed this instance. --- VCFv4.3.tex | 34 ++++++++++++++++++++++++---------- 1 file changed, 24 insertions(+), 10 deletions(-) diff --git a/VCFv4.3.tex b/VCFv4.3.tex index 58b64533c..c3ce009e1 100644 --- a/VCFv4.3.tex +++ b/VCFv4.3.tex @@ -100,24 +100,37 @@ \subsection{Data types} For the Integer type, the values from $-2^{31}$ to $-2^{31}+7$ cannot be stored in the binary version and therefore are disallowed in both VCF and BCF, see \ref{BcfTypeEncoding}. \subsection{Meta-information lines} -File meta-information is included after the \#\# string and must be key=value pairs. -Meta-information lines are optional, but if they are present then they must be completely well-formed. -Note that BCF, the binary counterpart of VCF, requires that all entries are present. -It is recommended to include meta-information lines describing the entries used in the body of the VCF file. +File meta-information lines start with ``\verb|##|'' and must appear first in the VCF file, before the header line (section~\ref{header-line}) and data record lines (section~\ref{data-lines}). +They may be either \emph{unstructured} or \emph{structured}. + +An \emph{unstructured} meta-information line consists of a~\emph{key} (denoting the type of meta-information recorded) and a~\emph{value} (which may not be empty and must not start with a `\verb|<|' character), separated by an `\verb|=|' character: +\begin{quote} +\verb|##|\emph{key}\verb|=|\emph{value} +\end{quote} +Several unstructured meta-information lines are defined in this specification, notably \verb|##fileformat|. +Others not defined by this specification, e.g.\ \verb|##fileDate| and \verb|##source|, are commonly found in VCF files. +These typically have meanings that are obvious, or they are immaterial for processing the file, or both. -All structured lines that have their value enclosed within ``$<>$'' require an ID which must be unique within their type. -For all of the structured lines (\#\#INFO, \#\#FORMAT, \#\#FILTER, etc.), extra fields can be included after the default fields. +A \emph{structured} meta-information line is similar, but the value is itself a comma-separated list of key=value pairs, enclosed within `\verb|<|' and `\verb|>|' characters: +\begin{quote} +\verb|##|\emph{key}\verb|=<|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\ldots\verb|>| +\end{quote} +All structured lines require an ID which must be unique within their type, i.e., within all the meta-information lines with the same ``\verb|##|\emph{key}\verb|=|'' prefix. +For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.), extra fields can be included after the default fields. For example: \begin{verbatim} -##INFO= +##INFO= \end{verbatim} In the above example, the extra fields of ``Source'' and ``Version'' are provided. -Optional fields must be stored as strings even for numeric values. +The values of optional fields must be written as quoted strings, even for numeric values. It is recommended in VCF and required in BCF that the header includes tags describing the reference and contigs backing the data contained in the file. These tags are based on the SQ field from the SAM spec; all tags are optional (see the VCF example above). -Meta-information lines can be in any order with the exception of `fileformat` which must come first. +Meta-information lines are optional, but if they are present then they must be completely well-formed. +Other than \verb|##fileformat|, they may appear in any order. +Note that BCF, the binary counterpart of VCF, requires that all entries are present. +It is recommended to include meta-information lines describing the entries used in the body of the VCF file. \subsubsection{File format} @@ -266,6 +279,7 @@ \subsubsection{Pedigree field format} \subsection{Header line syntax} +\label{header-line} The header line names the 8 fixed, mandatory columns. These columns are as follows: \begin{center} \#CHROM @@ -1306,7 +1320,7 @@ \subsubsection{Clonal derivation relationships} Alternately, if data on the genomes is compiled in a database, a simple pointer can be provided: \begin{verbatim} -##pedigreeDB= +##pedigreeDB=URL \end{verbatim} \begin{samepage} From 6f8b89aef9c9b5a714f37ce11cc89c5676499d87 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Wed, 5 Jan 2022 10:41:59 +0000 Subject: [PATCH 2/4] Remove `<>` from VCF test file `##pedigreeDB` lines The specification now consistently reflects that `##pedigreeDB`'s value should not be delimited by angle brackets (despite being an URL!). Adjust the failed_meta_pedigreedb_002.vcf files as the claimed cause of failure is the invalid URL hostname rather than the angle brackets. --- test/vcf/4.1/failed/failed_meta_pedigreedb_002.vcf | 2 +- test/vcf/4.1/passed/complexfile_passed_000.vcf | 2 +- test/vcf/4.1/passed/passed_meta_pedigreedb.vcf | 4 ++-- test/vcf/4.2/failed/failed_meta_pedigreedb_002.vcf | 2 +- test/vcf/4.2/passed/complexfile_passed_000.vcf | 2 +- test/vcf/4.2/passed/passed_meta_pedigreedb.vcf | 4 ++-- test/vcf/4.3/failed/failed_meta_pedigreedb_002.vcf | 2 +- test/vcf/4.3/passed/complexfile_passed_000.vcf | 2 +- test/vcf/4.3/passed/passed_meta_pedigreedb.vcf | 4 ++-- 9 files changed, 12 insertions(+), 12 deletions(-) diff --git a/test/vcf/4.1/failed/failed_meta_pedigreedb_002.vcf b/test/vcf/4.1/failed/failed_meta_pedigreedb_002.vcf index 2c6a7776e..a63dc8137 100644 --- a/test/vcf/4.1/failed/failed_meta_pedigreedb_002.vcf +++ b/test/vcf/4.1/failed/failed_meta_pedigreedb_002.vcf @@ -1,5 +1,5 @@ ##fileformat=VCFv4.1 ##CauseOfFailure=Non-valid URL -##pedigreeDB= +##pedigreeDB=ftp://8080:8080/not-valid/host/to/pedigreeDB #CHROM POS ID REF ALT QUAL FILTER INFO 1 123 . TC T . . . diff --git a/test/vcf/4.1/passed/complexfile_passed_000.vcf b/test/vcf/4.1/passed/complexfile_passed_000.vcf index 37dcd0a7f..e6e48476d 100644 --- a/test/vcf/4.1/passed/complexfile_passed_000.vcf +++ b/test/vcf/4.1/passed/complexfile_passed_000.vcf @@ -32,7 +32,7 @@ ##assembly=ftp://user@host:8080/path/to/file.fastq ##PEDIGREE= ##PEDIGREE= -##pedigreeDB= +##pedigreeDB=ftp://user@host:8080/path/to/pedigreeDB?arg1=db1 ##contig= ##contig= ##contig= diff --git a/test/vcf/4.1/passed/passed_meta_pedigreedb.vcf b/test/vcf/4.1/passed/passed_meta_pedigreedb.vcf index 06457a33a..2e598dbb6 100644 --- a/test/vcf/4.1/passed/passed_meta_pedigreedb.vcf +++ b/test/vcf/4.1/passed/passed_meta_pedigreedb.vcf @@ -1,5 +1,5 @@ ##fileformat=VCFv4.1 -##pedigreeDB= -##pedigreeDB= +##pedigreeDB=ftp://www.ebi.ac.uk:8080/valid/host/to/file.db +##pedigreeDB=http://123.0.1.2:8080/valid/host/to/file.db #CHROM POS ID REF ALT QUAL FILTER INFO 1 123 . TC T . . . diff --git a/test/vcf/4.2/failed/failed_meta_pedigreedb_002.vcf b/test/vcf/4.2/failed/failed_meta_pedigreedb_002.vcf index 5a66f388e..0fdfcab2c 100644 --- a/test/vcf/4.2/failed/failed_meta_pedigreedb_002.vcf +++ b/test/vcf/4.2/failed/failed_meta_pedigreedb_002.vcf @@ -1,5 +1,5 @@ ##fileformat=VCFv4.2 ##CauseOfFailure=Non-valid URL -##pedigreeDB= +##pedigreeDB=ftp://8080:8080/not-valid/host/to/pedigreeDB #CHROM POS ID REF ALT QUAL FILTER INFO 1 123 . TC T . . . diff --git a/test/vcf/4.2/passed/complexfile_passed_000.vcf b/test/vcf/4.2/passed/complexfile_passed_000.vcf index c1b8b922a..f8abb80ff 100644 --- a/test/vcf/4.2/passed/complexfile_passed_000.vcf +++ b/test/vcf/4.2/passed/complexfile_passed_000.vcf @@ -32,7 +32,7 @@ ##assembly=ftp://user@host:8080/path/to/file.fastq ##PEDIGREE= ##PEDIGREE= -##pedigreeDB= +##pedigreeDB=ftp://user@host:8080/path/to/pedigreeDB?arg1=db1 ##contig= ##contig= ##contig= diff --git a/test/vcf/4.2/passed/passed_meta_pedigreedb.vcf b/test/vcf/4.2/passed/passed_meta_pedigreedb.vcf index 1c34eefc9..4ec472c0b 100644 --- a/test/vcf/4.2/passed/passed_meta_pedigreedb.vcf +++ b/test/vcf/4.2/passed/passed_meta_pedigreedb.vcf @@ -1,5 +1,5 @@ ##fileformat=VCFv4.2 -##pedigreeDB= -##pedigreeDB= +##pedigreeDB=ftp://www.ebi.ac.uk:8080/valid/host/to/file.db +##pedigreeDB=http://123.0.1.2:8080/valid/host/to/file.db #CHROM POS ID REF ALT QUAL FILTER INFO 1 123 . TC T . . . diff --git a/test/vcf/4.3/failed/failed_meta_pedigreedb_002.vcf b/test/vcf/4.3/failed/failed_meta_pedigreedb_002.vcf index 9b051a7a0..1a98618cd 100644 --- a/test/vcf/4.3/failed/failed_meta_pedigreedb_002.vcf +++ b/test/vcf/4.3/failed/failed_meta_pedigreedb_002.vcf @@ -1,5 +1,5 @@ ##fileformat=VCFv4.3 ##CauseOfFailure=Non-valid URL -##pedigreeDB= +##pedigreeDB=ftp://8080:8080/not-valid/host/to/pedigreeDB #CHROM POS ID REF ALT QUAL FILTER INFO 1 123 . TC T . . . diff --git a/test/vcf/4.3/passed/complexfile_passed_000.vcf b/test/vcf/4.3/passed/complexfile_passed_000.vcf index 467d846eb..df974d5fe 100644 --- a/test/vcf/4.3/passed/complexfile_passed_000.vcf +++ b/test/vcf/4.3/passed/complexfile_passed_000.vcf @@ -32,7 +32,7 @@ ##assembly=ftp://user@host:8080/path/to/file.fastq ##PEDIGREE= ##PEDIGREE= -##pedigreeDB= +##pedigreeDB=ftp://user@host:8080/path/to/pedigreeDB?arg1=db1 ##contig= ##contig= ##contig= diff --git a/test/vcf/4.3/passed/passed_meta_pedigreedb.vcf b/test/vcf/4.3/passed/passed_meta_pedigreedb.vcf index 0beb16fa1..cad8c6048 100644 --- a/test/vcf/4.3/passed/passed_meta_pedigreedb.vcf +++ b/test/vcf/4.3/passed/passed_meta_pedigreedb.vcf @@ -1,5 +1,5 @@ ##fileformat=VCFv4.3 -##pedigreeDB= -##pedigreeDB= +##pedigreeDB=ftp://www.ebi.ac.uk:8080/valid/host/to/file.db +##pedigreeDB=http://123.0.1.2:8080/valid/host/to/file.db #CHROM POS ID REF ALT QUAL FILTER INFO 1 123 . TC T . . . From a8e8c2ae9772847fbbda4101d45739171c6f0fad Mon Sep 17 00:00:00 2001 From: John Marshall Date: Thu, 6 Jan 2022 12:38:27 +0000 Subject: [PATCH 3/4] Fix misleading `##pedigreeDB=` notation in older VCF specifications PR #88 removed the `<>` from `##pedigreeDB=` in VCFv4.3.tex, presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally. As some readers still refer to these older documents, remove the misleading notation here too. --- VCFv4.1.tex | 4 ++-- VCFv4.2.tex | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/VCFv4.1.tex b/VCFv4.1.tex index d791d3937..593298c46 100644 --- a/VCFv4.1.tex +++ b/VCFv4.1.tex @@ -136,7 +136,7 @@ \subsubsection{Pedigree field format} \end{verbatim} or a link to a database: \begin{verbatim} -##pedigreeDB= +##pedigreeDB=URL \end{verbatim} \subsection{Header line syntax} The header line names the 8 fixed, mandatory columns. These columns are as follows: @@ -893,7 +893,7 @@ \subsubsection{Clonal derivation relationships} Alternately, if data on the genomes is compiled in a database, a simple pointer can be provided: \begin{verbatim} -##pedigreeDB= +##pedigreeDB=URL \end{verbatim} The most general form of a pedigree line is: diff --git a/VCFv4.2.tex b/VCFv4.2.tex index ca8e2a998..28f2b3d50 100644 --- a/VCFv4.2.tex +++ b/VCFv4.2.tex @@ -153,7 +153,7 @@ \subsubsection{Pedigree field format} \end{verbatim} or a link to a database: \begin{verbatim} -##pedigreeDB= +##pedigreeDB=URL \end{verbatim} \subsection{Header line syntax} The header line names the 8 fixed, mandatory columns. These columns are as follows: @@ -910,7 +910,7 @@ \subsubsection{Clonal derivation relationships} Alternately, if data on the genomes is compiled in a database, a simple pointer can be provided: \begin{verbatim} -##pedigreeDB= +##pedigreeDB=URL \end{verbatim} The most general form of a pedigree line is: From 2da6d4eed458375c5402df7e33b041425b90eabe Mon Sep 17 00:00:00 2001 From: John Marshall Date: Tue, 1 Feb 2022 00:52:36 +0000 Subject: [PATCH 4/4] Mention structured lines that are not defined by the VCF specification --- VCFv4.3.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/VCFv4.3.tex b/VCFv4.3.tex index c3ce009e1..449319763 100644 --- a/VCFv4.3.tex +++ b/VCFv4.3.tex @@ -116,13 +116,14 @@ \subsection{Meta-information lines} \verb|##|\emph{key}\verb|=<|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\ldots\verb|>| \end{quote} All structured lines require an ID which must be unique within their type, i.e., within all the meta-information lines with the same ``\verb|##|\emph{key}\verb|=|'' prefix. -For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.), extra fields can be included after the default fields. +For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.) described in this specification, extra fields can be included after the default fields. For example: \begin{verbatim} ##INFO= \end{verbatim} In the above example, the extra fields of ``Source'' and ``Version'' are provided. The values of optional fields must be written as quoted strings, even for numeric values. +Other structured lines not defined by this specification may also be used; the only default field for such lines is the required \verb|ID| field. It is recommended in VCF and required in BCF that the header includes tags describing the reference and contigs backing the data contained in the file. These tags are based on the SQ field from the SAM spec; all tags are optional (see the VCF example above).