This project contains a plugin for integrating Nextflow pipelines with IRIDA Next. In particular, it will enable a pipeline to produce output consistent with the IRIDA Next pipeline standards.
The following is the minimal configuration needed for this plugin.
nextflow.config
plugins {
id 'nf-iridanext'
}
iridanext {
enabled = true
output {
path = "${params.outdir}/iridanext.output.json.gz"
}
}
When run with a pipeline (e.g., the IRIDA Next Example pipeline), the configuration will produce the following JSON output in a file ${params.outdir}/iridanext.output.json.gz
.
Note: ${params.outdir}
is optional and is used in the case where the file should be written to the output directory specified by --outdir
.
iridanext.output.json.gz
{
"files": {
"global": [],
"samples": {}
},
"metadata": {
"samples": {}
}
}
This file conforms to the standards as defined in the IRIDA Next Pipeline Standards document.
To include files to be saved within IRIDA Next, you can define path match expressions under the iridanext.output.files
section. The global section is used for global output files for the pipeline while the samples is used for output files associated with particular samples (matching to sample identifiers is automatically performed).
nextflow.config
plugins {
id 'nf-iridanext'
}
iridanext {
enabled = true
output {
path = "${params.outdir}/iridanext.output.json.gz"
overwrite = true
files {
global = ["**/summary/summary.txt.gz"]
samples = ["**/assembly/*.assembly.fa.gz"]
}
}
}
This configuration will produce the following example JSON output:
iridanext.output.json.gz
{
"files": {
"global": [{ "path": "summary/summary.txt.gz" }],
"samples": {
"SAMPLE1": [{ "path": "assembly/SAMPLE1.assembly.fa.gz" }],
"SAMPLE3": [{ "path": "assembly/SAMPLE3.assembly.fa.gz" }],
"SAMPLE2": [{ "path": "assembly/SAMPLE2.assembly.fa.gz" }]
}
},
"metadata": {
"samples": {}
}
}
Files are matched to samples using the meta.id
map used by nf-core formatted modules. The matching key (id
in meta.id
) can be overridden by setting:
iridanext.output.files.idkey = "newkey"
Metadata associated with samples can be included by filling in the the iridanext.output.metadata.samples
section, like below:
nextflow.config
plugins {
id 'nf-iridanext'
}
iridanext {
enabled = true
output {
path = "${params.outdir}/iridanext.output.json.gz"
overwrite = true
metadata {
samples {
csv {
path = "**/output.csv"
idcol = "column1"
}
}
}
}
}
This will parse a CSV file for metadata. The csv.path
keyword specifies the file to parse. The csv.idcol
defines the column that should match to the sample identifiers.
If there exists an example CSV file like the following:
output.csv
column1 | b | c |
---|---|---|
SAMPLE1 | 2 | 3 |
SAMPLE2 | 4 | 5 |
SAMPLE3 | 6 | 7 |
Then running the pipeline will produce an output like the following:
iridanext.output.json.gz
{
"files": {
"global": [],
"samples": {}
},
"metadata": {
"samples": {
"SAMPLE1": { "b": "2", "c": "3" },
"SAMPLE2": { "b": "4", "c": "5" },
"SAMPLE3": { "b": "6", "c": "7" }
}
}
}
The CSV parser will only include metadata in the final output JSON for sample identifiers in the CSV file (defined in the column specified by csv.idcol
) that match to sample identifiers in the pipeline meta map (the key in the meta map defined using iridanext.output.files.idkey
).
If, instead of parsing a CSV file, you wish to parse metadata from a JSON file, then you can replace the csv {}
configuration section above with:
json {
path = "**/output.json"
}
For example, a JSON file like the following:
output.json
{
"SAMPLE1": {
"key1": "value1",
"key2": ["a", "b"]
},
"SAMPLE2": {
"key1": "value2"
}
}
Would result in the following output:
iridanext.output.json.gz
{
"files": {
"global": [],
"samples": {}
},
"metadata": {
"samples": {
"SAMPLE1": { "key1": "value1" },
"SAMPLE2": { "key2": ["a", "b"] }
}
}
}
Setting the configuration value iridanext.output.metadata.samples.flatten = true
will flatten the metadata JSON to a single level of key/value pairs (using dot .
notation for keys).
The two scenarios show the difference between flatten = false
(default) and flatten = true
.
{
"files": {
"global": [],
"samples": {}
},
"metadata": {
"samples": {
"SAMPLE1": {
"key1": {
"subkey1": "value1",
"subkey2": "value2"
}
},
"SAMPLE2": {
"key2": ["a", "b"]
}
}
}
}
{
"files": {
"global": [],
"samples": {}
},
"metadata": {
"samples": {
"SAMPLE1": {
"key1.subkey1": "value1",
"key1.subkey2": "value2"
},
"SAMPLE2": {
"key2.1": "a",
"key2.2": "b"
}
}
}
}
The iridanext.output.metadata.samples.{ignore,keep,rename}
configuration options can be used to adjust what is stored within the metadata JSON structure.
Note: If flatten=true
is enabled, then the metadata key names here refer to the flattened names.
Setting iridanext.output.metadata.samples.ignore = ["b"]
in the config (like below) will cause the metadata with the key b to be ignored in the final IRIDA Next output JSON file.
For example, in the config below:
nextflow.config
plugins {
id 'nf-iridanext'
}
iridanext {
enabled = true
output {
path = "${params.outdir}/iridanext.output.json.gz"
overwrite = true
metadata {
samples {
ignore = ["b"]
csv {
path = "**/output.csv"
idcol = "column1"
}
}
}
}
}
If this used to load the below CSV file.
output.csv
column1 | b | c |
---|---|---|
SAMPLE1 | 2 | 3 |
SAMPLE2 | 4 | 5 |
SAMPLE3 | 6 | 7 |
Then an output like below is produced (that is, the b column is ignored).
iridanext.output.json.gz
{
"files": {
"global": [],
"samples": {}
},
"metadata": {
"samples": {
"SAMPLE1": { "c": "3" },
"SAMPLE2": { "c": "5" },
"SAMPLE3": { "c": "7" }
}
}
}
Setting iridanext.output.metadata.samples.keep = ["b"]
is similar to the ignore case, except the listed columns will be kept.
iridanext.output.json.gz
{
"files": {
"global": [],
"samples": {}
},
"metadata": {
"samples": {
"SAMPLE1": { "b": "2" },
"SAMPLE2": { "b": "4" },
"SAMPLE3": { "b": "6" }
}
}
}
Setting iridanext.output.metadata.samples.rename
will rename the listed keys to new key names (specified as a Map). For example:
nextflow.config
plugins {
id 'nf-iridanext'
}
iridanext {
enabled = true
output {
path = "${params.outdir}/iridanext.output.json.gz"
overwrite = true
metadata {
samples {
rename = ["b": "b_col"]
csv {
path = "**/output.csv"
idcol = "column1"
}
}
}
}
}
iridanext.output.json.gz
{
"files": {
"global": [],
"samples": {}
},
"metadata": {
"samples": {
"SAMPLE1": { "b_col": "2", "c": "3" },
"SAMPLE2": { "b_col": "4", "c": "5" },
"SAMPLE3": { "b_col": "6", "c": "7" }
}
}
}
There are two different scenarios where metadata key/value pairs could be missing for a sample, which result in different behaviours in IRIDA Next.
-
Ignore key: If the
key
is left out of the samples metadata in the IRIDA Next JSON, then nothing is written for thatkey
for the sample. Any existing metadata under thatkey
will remain in IRIDA Next. -
Delete key: If a metadata value is an empty string (
"key": ""
) or null ("key": null
), then IRIDA Next will remove that particular metadata key/value pair from the sample metadata if it exists. This is the expected scenario if pipeline results contain missing (or N/A) values (deleting older metadata keys prevents mixing up old and new pipeline analysis results in the metadata table).
The following are the expectations for writing missing values in the final IRIDA Next JSON file (in order to delete the key/value pairs in IRIDA Next).
If the metadata key b
for SAMPLE1 is encoded as an empty string ""
or null
in the JSON file like the below example:
output.json
{
"SAMPLE1": {
"a": "value1",
"b": ""
}
}
Then the final IRIDA Next JSON file will preserve the empty string/null value in the samples metadata section:
iridanext.output.json.gz
"metadata": {
"samples": {
"SAMPLE1": { "a": "value1", "b": "" }
}
}
If the metadata key b
for SAMPLE1 is left empty in the CSV file like the below two examples:
output.csv as table
column1 | b | c |
---|---|---|
SAMPLE1 | 3 | |
SAMPLE2 | 4 | 5 |
SAMPLE3 | 6 | 7 |
output.csv as CSV
column1,b,c
SAMPLE1,,3
SAMPLE2,4,5
Sample3,6,7
Then the value for b
for SAMPLE1 will be written as an empty string in the IRIDA Next JSON file:
iridanext.output.json.gz
"metadata": {
"samples": {
"SAMPLE1": { "b": "", "c": "3" },
"SAMPLE2": { "b": "4", "c": "5" },
"SAMPLE3": { "b": "6", "c": "7" }
}
}
In order to build this plugin you will need a Java Development Kit (such as OpenJDK) and Groovy. For Ubuntu, this can be installed with:
sudo apt install default-jdk groovy
In order to build and install the plugin from source, please do the following:
git clone https://github.com/phac-nml/nf-iridanext.git
cd nf-iridanext
make buildPlugins
Please see the Nextflow plugins documentation and the nf-hello example plugin for more details.
cp -r build/plugins/nf-iridanext-0.2.0 ~/.nextflow/plugins
This copies the compiled plugin files into the Nextflow plugin cache (default ~/.nextflow/plugins
). Please change the version 0.2.0
to the version of the plugin built from source.
In order to use the built plugin, you have to specify the exact version in the Nextflow configuration so that Nextflow does not try to update the plugin. That is, in the configuration use:
plugins {
id '[email protected]'
}
In order to run the test cases, please clone this repository and run the following command:
./gradlew check
To get more information for any failed tests, please run:
./gradlew check --info
One use case of this plugin is to structure reads and metadata downloaded from NCBI/ENA for storage in IRIDA Next by making use of the nf-core/fetchngs pipeline. The example configuration fetchngs.conf can be used for this purpose. To test, please run the following (using ids.csv as example data accessions):
# Download config and SRA accessions
wget https://raw.githubusercontent.com/phac-nml/nf-iridanext/main/docs/examples/fetchngs/fetchngs.conf
wget https://raw.githubusercontent.com/phac-nml/nf-iridanext/main/docs/examples/fetchngs/ids.csv
nextflow run nf-core/fetchngs -profile singularity --outdir results --input ids.csv -c fetchngs.conf
This will produce the following output: iridanext.output.json.
This plugin was developed based on the nf-hello
Nextflow plugin template https://github.com/nextflow-io/nf-hello. Other sources of information for development include the nf-prov and nf-validation Nextflow plugins, as well as the Nextflow documentation.
Copyright 2023 Government of Canada
Original nf-hello project Copyright to respective authors
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.