Skip to content

telatin/getreads

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

getreads [unsupported]

Note

I recommend using nf-core/fetchngs This repository is no longer supported as of 2023

A minimal pipeline to download FASTQ files from SRA given a list of accession IDs.

🪄 Usage

See installation for more details

# Suggestion: replace main with a version from the releases 
  nextflow run telatin/getreads -r main   -profile docker \
     --list list.txt --outdir downloaded-reads/

Where:

  • --list "list.txt" is a list of SRA accession IDs in simple text format

  • --outdir "name" is the name of the output directory

  • --wait INT is the number of seconds to wait after running ffq [default: 2]

  • -profile docker will used Docker for dependencies. An easy alternative is to create a conda environment using deps/env.yaml. Singularity is supported but untested (usually clusters with singularity are offline anyway)

📂 Output

The output directory contains:

  • 📁 json (JSON file, one for each accession)
  • 📁 urls (text files with the download URIs)
  • 📁 reads (FASTQ.gz files, a set per accession)
  • 🗒️ stats.txt (reads statistics)
  • 🗒️ check.txt (a report of number of files per ID downloaded, with control of number of reads per file being equal)
  • 🗒️ table.tsv metadata table from JSON files (only for samples where ffq didn't fail) (new in 2.0)

Alternatives

nf-core/fetchngs ⭐ is a fully-featured pipeline to download reads and associated metadata. It's a fantastic and regularly update tool. Since sometimes it failed for me for reasons related to its complexity, I made this minimal pipeline as a backup plan.

Uses

  • ffq to fetch URLs given the accessions, wrapped in ffq-sake.py that retries if NCBI responds with "too many requests", but gracefully fails on 400 error.
  • wget to download the reads
  • seqfu to collect stats

Screenshot

Screenshot

Cite

If you use this pipeline, please cite:

  • Gálvez-Merchán, Á., et al. (2023). Metadata retrieval from sequence databases with ffq. Bioinformatics
  • Telatin, A., et al. (2020). SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files. Bioengineering