PacBio-utilities

A collection of scripts useful for working with PacBio-based assemblies.

The pacbio-util script requires BioPerl and samtools >= 0.1.19.

Correction of small indels in polished assemblies

We have noticed that PacBio assemblies can show single-base deletions, in most cases, in proximity to even fairly short homopolymer runs. This does not occur at all homopolymer runs, and in our limited samples incidence rate is perhaps 1 in 10kbp. Single-base insertions and longer insertions may occur but are much rarer.

There are three steps to applying this correction. First, map a set of Illumina reads to the PacBio assembly to generate sorted BAM file(s). Reads from multiple BAMs will be combined. These reads should be from the same individual/strain from which the assembly was created. Reads from other sequencing technologies, or from another individual/strain, will reduce the effectiveness of the correction and may introduce further errors. In all subsequent steps it is required that the order of the sequences in the assembly and the BAMs is identical.

Second, use the pacbio-util script here to generate a set of indel targets for correction. The command is pacbio-util indel-targets, and the criteria for determining targets may be modified via options. Use the -h option to see available options and current defaults.

The third and final step is to use the command pacbio-util indel-apply to apply the targets generated in the second step to the assembly to create a corrected assembly. Options may also be used here to restrict the set of targets applied; use the -h option to see available options and current defaults.

For an assembly pacbio_assembly.fasta, and mappings of Illumina paired-end and single-end reads to this assembly in BAM files pe.sorted.bam and se.sorted.bam, the following will generated the list of indel targets to targets.txt and a corrected assembly to corrected_assembly.fasta.

pacbio-util indel-targets -f pacbio_assembly.fasta pe.sorted.bam se.sorted.bam > targets.txt
pacbio-util indel-apply -f pacbio_assembly.fasta -t targets.txt > corrected_assembly.fasta

The indel-apply command can find the assembly used for determining the targets by using the first line of the list of targets. This provides a useful shortcut for combining both commands into a one-line piped command. The tee is in place to save the list of targets as they pass through to indel-apply.

pacbio-util indel-targets -f pacbio_assembly.fasta pe.sorted.bam se.sorted.bam \
| tee targets.txt \
| pacbio-util indel-apply > corrected_assembly.fasta

Here is example tool output (with -v) and scaffold size results before and after correction.

$ pacbio-util indel-targets -v -f pacbio_assembly.fasta pe.sorted.bam se.sorted.bam | tee targets.txt | pacbio-util indel-apply -v > corrected_assembly.fasta
samtools pipe command: 'samtools mpileup -s -B -d 10000 -L 10000 --ff 1280 -q 1 -Q 13 -f pacbio_assembly.fasta pe.sorted.bam se.sorted.bam |'
[mpileup] 2 samples in 2 input files
Merging columns from 2 BAM files ...
pacbio-util indel-targets: 1281 targets generated for assembly pacbio_assembly.fasta
pacbio-util indel-apply: Opening target assembly 'pacbio_assembly.fasta'
pacbio-util indel-apply: unitig_7, no targets
pacbio-util indel-apply: unitig_52, no targets
pacbio-util indel-apply: unitig_46, no targets
pacbio-util indel-apply: unitig_14, no targets
pacbio-util indel-apply: unitig_54, no targets
pacbio-util indel-apply: unitig_44, no targets
pacbio-util indel-apply: unitig_13, no targets
pacbio-util indel-apply: unitig_56, no targets
pacbio-util indel-apply: unitig_55, no targets
pacbio-util indel-apply: unitig_43, no targets
pacbio-util indel-apply: unitig_48, no targets
pacbio-util indel-apply: unitig_45, no targets
pacbio-util indel-apply: unitig_15, no targets
pacbio-util indel-apply: unitig_47, no targets
pacbio-util indel-apply: unitig_60, no targets
pacbio-util indel-apply: unitig_61, no targets
pacbio-util indel-apply: unitig_51, no targets
pacbio-util indel-apply: unitig_59, no targets
pacbio-util indel-apply: unitig_4, no targets
pacbio-util indel-apply: 1281 targets applied to assembly pacbio_assembly.fasta
$

Scaffold	bp before	bp after
unitig_1	6250549	6250752
unitig_10	1811588	1811655
unitig_7	15299	15299
unitig_52	7932	7932
unitig_46	8606	8606
unitig_14	1884	1884
unitig_0	10120325	10120600
unitig_8	3934972	3935082
unitig_12	240123	240143
unitig_3	16252	16253
unitig_54	22065	22065
unitig_44	12520	12520
unitig_53	5951772	5951911
unitig_13	86224	86224
unitig_56	30475	30475
unitig_55	15275	15275
unitig_41	19231	19232
unitig_43	9344	9344
unitig_58	6608738	6608943
unitig_9	2363594	2363680
unitig_48	30325	30325
unitig_45	14763	14763
unitig_57	18691	18685
unitig_15	2004	2004
unitig_47	15558	15558
unitig_6	4486498	4486633
unitig_60	26006	26006
unitig_61	16086	16086
unitig_51	9410	9410
unitig_59	14740	14740
unitig_4	19898	19898

`pacbio-util indel-targets`

Generate indel targets for correction. Uses samtools mpileup for indel detection. Targets are written to standard output.

Targets will only be generated where all read mappings containing an indel agree on the size and sequence of the indel. Further criteria for minimum indel fraction, minimum coverage, and maximum indel size may be adjusted via options.

pacbio-util indel-targets -f FASTA FILE1.bam [ FILE2.bam ... ]

OPTIONS

    -f FILE, --fasta FILE    PacBio assembly in Fasta format

    FILE1.bam [ FILE2.bam ]  BAM files of reads mapped to

    -                        Read mpileup from stdin rather than running
                             samtools. Note that mapping qualities
                             (option `-s`) are expected in the mpileup input.

    --indel-frac FLOAT       Do not report indels at a position if the
                             fraction of reads containing them is below FLOAT
                             [default 0.8]
    --indel-min-cov INT      Minimum read coverage to consider an indel
                             [default 10]
    --include-bad-indels     Include indels in target set that do not pass
                             our criteria, these lines are marked 'bad' in
                             the sixth column
    --max-indel-size INT     Maximum indel size to accept, beyond this the
                             position is considered in error [default 1000]
    --mapping-quality FLOAT  Minimum mean mapping quality of reads covering a
                             target site. Note that a value of 0 means no
                             minimum applied; reads with 0 mapQ are still
                             excluded by default (see --all-reads) [default 0]
    --all-reads              Include all reads.  Default is to exclude
                             multiply-mapped reads (mapQ < 1), alignments
                             that are not the primary (SAM flag 256), and
                             duplicate alignments (flag 1024).
    --all-bases              Include all bases.  Default is to exclude bases
                             below quality 13.
    --samtools FILE          Location of samtools executable
                             [default samtools]

    --verbose                Print informational messages
    --help, -?               help message

Below is an example subset of targets generated with this command. The first line of the targets output is the name of the assembly Fasta file against which the targets were generated. Following that, the columns are scaffold name, target position, reference base, total read coverage, type of target (always indel for this command), whether the target passed quality criteria (good or bad), the frequency of coverage supporting the indel, the size of the indel (positive for insertion, negative for deletion), a string describing the indel size and sequence, the number of reads supporting the indel, and the mean mapping quality of reads.

#assembly:pacbio_assembly.fasta
unitig_1	101522	C	72	indel	good	0.9444	1	+1A	68	60
unitig_1	119319	G	127	indel	good	0.8110	1	+1A	103	60
unitig_1	120891	A	104	indel	good	0.9231	1	+1T	96	60
unitig_1	122801	A	93	indel	good	0.9032	1	+1C	84	60
unitig_1	125303	G	84	indel	good	0.9286	1	+1A	78	60
unitig_1	131734	T	67	indel	good	0.8657	1	+1A	58	60
unitig_1	136434	A	28	indel	good	0.8214	1	+1C	23	60
unitig_1	170949	A	40	indel	good	0.9500	1	+1G	38	60
unitig_1	171182	A	80	indel	good	0.9500	1	+1G	76	60
unitig_1	172016	A	67	indel	good	0.9403	1	+1C	63	60
unitig_1	190068	A	18	indel	good	1.0000	1	+1G	18	60
unitig_1	190154	T	11	indel	good	0.9091	1	+1G	10	60
unitig_1	191027	C	104	indel	good	0.9231	1	+1G	96	60
unitig_1	191173	C	96	indel	good	0.9167	1	+1T	88	60
unitig_1	230981	T	63	indel	good	0.9683	1	+1C	61	60
unitig_1	270949	G	100	indel	good	0.9800	1	+1C	98	60
unitig_1	307836	A	38	indel	good	0.8947	1	+1C	34	60
unitig_1	310152	C	105	indel	good	0.8857	1	+1A	93	60
unitig_1	326536	G	105	indel	good	0.9143	1	+1A	96	60
unitig_1	360786	G	39	indel	good	0.8205	1	+1A	32	58.5
unitig_1	369367	C	98	indel	good	0.9592	1	+1A	94	60
unitig_1	371657	T	48	indel	good	0.9375	1	+1G	45	60
unitig_1	371776	G	43	indel	good	1.0000	1	+1T	43	60
unitig_1	373159	C	113	indel	good	0.9469	1	+1A	107	60

`pacbio-util indel-apply`

Apply indel targets to correct an assembly. The corrected assembly is written to standard output. The order of sequences in the assembly must be the same as that in the targets.

pacbio-util indel-apply [ -f FASTA ] [ -t TARGETS ]

OPTIONS

    -f | --fasta FILE        PacBio assembly in Fasta format, to be corrected.
                             Must be the same assembly used for
                             './pacbio-util indel-targets'.
                             If the assembly is not specified, it is determined
                             from the first line of the target list.
    -t | --targets TARGETS   List of indel targets to apply, standard output
                             from './pacbio-util indel-targets -f FILE ...'.  If no
                             targets are specified, they are read from stdin.

    --include-bad-indels     Apply indels in target set that are marked 'bad'
    --max-indel-size INT     Maximum indel size to apply [default 1000]
    --min-indel-frac FLOAT   Minimim indel fraction to apply, a value of 0
                             means apply all indels in target set [default 0]
    --skip-deletion-verify   Do not verify that a deletion matches the
                             sequence at that position in the assembly
                             [default 0]
    --skip-softmasked        Do not apply targets to softmasked regions of the
                             assembly, as determined by use of lowercase
                             [default 0]

    --verbose                Print informational messages
    --help, -?               help message

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
pacbio-util		pacbio-util

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PacBio-utilities

Correction of small indels in polished assemblies

`pacbio-util indel-targets`

`pacbio-util indel-apply`

About

Releases

Packages

Languages

License

tiramisutes/PacBio-utilities

Folders and files

Latest commit

History

Repository files navigation

PacBio-utilities

Correction of small indels in polished assemblies

pacbio-util indel-targets

pacbio-util indel-apply

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`pacbio-util indel-targets`

`pacbio-util indel-apply`

Packages