Skip to content

Computational pipeline for calling consensi on R2C2 nanopore data

License

Notifications You must be signed in to change notification settings

KamilMaliszArdigen/C3POa

 
 

Repository files navigation

C3POa

C3POa (Concatemeric Consensus Caller with Partial Order alignments) is a computational pipeline for calling consensi on R2C2 nanopore data.

This version of C3POa uses a new aligner (gonk). The old version of C3POa that uses water can be found here.

Dependencies

To fetch and build dependencies, use setup.sh.
setup.sh will download and make the packages that you need to run C3POa (except Python, NumPy, Go, and blat).
You don't need to have these in your PATH, but if you don't, you'll need to use a config file. The setup script does not install programs or add them to your path. If you use the setup script, you still need to put the paths into a config file.

chmod +x setup.sh
./setup.sh

To install NumPy, you can go here.
Otherwise, you can use your computer's package manager (apt-get, dnf, brew, etc.) to install.
Pip3 is another option for NumPy installation.
Example:

sudo dnf install python3-numpy

or

pip3 install numpy

For blat, there are a couple options. You can build from source or you can get an executable. Please follow the documentation in the blat readme for make instructions.


Usage

After resolving all of the dependencies, you can run C3POa with python.

C3POa_preprocessing.py

Takes raw 1D nanopore R2C2 reads in fastq format, removes low quality and short reads and then finds splint sequences in those reads using BLAT. It then adds the position of the splint to the read name. Preprocessing will also demultiplex reads based on splints that are put into the splint fasta file. You should end up with a directory that looks like: somePath/Splint_1/R2C2_raw_reads.fastq where Splint_1 would be which Splint that particular read was classified with.

Options (All required):

  -i  raw reads in fastq format

  -o  path where a splint_reads folder with the output files will be generated

  -q  only reads above this average quality will be retained (9 is recommended)

  -l  only reads longer than this number will be retained (1000 recommended)

  -s  sequence of DNA splint used in R2C2 protocol in fasta format

  -c  config file containing path to BLAT binary
python3 C3POa_preprocessing.py -i raw_reads.fastq -o output_path -q quality_cutoff
                               -l read_length_cutoff -s Splint_sequence.fasta

Example input read:

@63f115bc-6a91-42bd-a78a-667fd8255069
ACAGTCGATCATAGCTTAGCATGCATCGACGATCGATCGATCGA
+
"01&%"."I;"CSA"qr{X"uvc"\n"ggZ"Swj"yq"{wD"{z

Example output read (10 would be a splint position):

@63f115bc-6a91-42bd-a78a-667fd8255069_10
ACAGTCGATCATAGCTTAGCATGCATCGACGATCGATCGATCGA
+
"01&%"."I;"CSA"qr{X"uvc"\n"ggZ"Swj"yq"{wD"{z

C3POa.py

Takes fastq output produced by C3POa_preprocessing.py and generates consensus sequences in fasta format and a subread sequences in fastq format. C3POa now natively supports multiprocessing.

Options:

  -p  directory to which all temporary files will be written. Also where your final consensus
      file will end up. Defaults to your current directory

  -m  path to NUC.4.4.mat file (included in repository)

  -l  raw sequence length cutoff. Defaults to 1000

  -d  median distance between peaks cutoff. This should be the length of your shortest
      input sequence in your library preparation. Defaults to 500

  -c  config file containing paths to poa, racon, gonk, blat, and minimap2

  -z  use to exclude zero repeat reads

  -r  fastq file that contains reads generated by C3POa_preprocessing.py

  -t  use to print how long each dependency takes to run

  -n  the number of threads to use in multiprocessing. Defaults to 1

  -g  the number of reads processed by each thread. Defaults to 1000

  -s  the name of the sample. Defaults to R2C2
python3 C3POa.py -r preprocessed_reads.fastq -p outpath -m path/to/NUC.4.4.mat -l 1000
                 -t 8 -g 1000 -d 500 -c /path/to/config_file -o /path/to/consensus.fasta

Timing things and excluding zero repeat reads:

python3 C3POa.py -t -z -r preprocessed_reads.fastq -p outpath -m path/to/NUC.4.4.mat -l 1000
                 -d 500 -c /path/to/config_file -o /path/to/consensus.fasta

When you include -t (--timer), gonk, poa, racon, and consensus.py will be timed (times are directed to stdout).

When you include -z (--zero), C3POa will exclude zero repeat reads.

Example output read (readName_averageQuality_originalReadLength_numberOfRepeats_subreadLength):

>efbfbf09-7e2b-48e6-8e57-b3d36886739c_46.53_5798_2_1844
ACAGTCGATCATAGCTTAGCATGCATCGACGATCGATCGATCGA...

C3POa_postprocessing.py

Trims and reorients consensus sequences generated by C3POa.py to 5'->3' direction

  -i  fasta file containing consensus sequences generated by C3POa.py

  -o  directory which output files will be written to

  -a  sequence of cDNA adapter sequences in fasta format. Sequence names must be
      3Prime_adapter and 5Prime_adapter

  -c  config file containing path to BLAT binary

  -u  use to ignore read directionality

  -t  use to trim adapters off of the ends of the sequences
python3 C3POa_postprocessing.py -i /path/to/consensus.fasta -o out_path
                                -c /path/to/config_file -a /path/to/adapter.fasta

About

Computational pipeline for calling consensi on R2C2 nanopore data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.1%
  • Shell 0.9%