Releases: Gregor-Mendel-Institute/bookend
v1.2.1
Bugfixes for new softclipping behavior and splice junction filtering.
- In v1.2.0 the filter
--max_intron
was incorrectly implemented inbookend condense
, producing a crash - Default behavior for
bookend elr --data_type ONT
now assumes start and end tags on all reads bookend elr
now does not require softclipping to recognize a start tag passed in the read name or with the argument-s
- Swapped
bookend elr --remove_noncanonical
for--allow_noncanonical
, making discarding noncanonical splice junctions default behavior
Bookend v1.2.0: merge update
Feature addition for Bookend to implement bookend merge
. This new utility lets you integrate one or more assemblies into a reference annotation, following gene and transcript naming conventions. Reference transcripts with a matching assembly will have their 5' and 3' ends updated, and they will be given evidence attributes that describe how many times they were assembled and in which samples.
Merge behavior:
- Process transcripts in descending order of total genomic length
- Merge assemblies first, applying no filters
- Combine the attributes of merged transcripts according to
--attr_merge
: sum or mean of expression values (TPM, cov, S.reads, E.reads) - All assemblies classified as 'full_match' to another assembly will be combined into a single transcript model
- Combine the attributes of merged transcripts according to
- Integrate merged assemblies with reference (decreasing length):
- Determine the class of each merged transcript vs. reference
- Name the transcript and add it to the reference list
- 'full_match' retain the original transcript_id
- 'exon_match' transcripts with 5' and/or 3' variation are named after the matching transcript_id with an extra suffix '_<count>'
- Novel isoforms are given the gene_id with a suffix '.i<count>'
- Novel antisense transcripts receive the '-AS' suffix
- Intronic transcripts: '-IT' suffix
- Intergenic transcripts are named 'BOOKEND_<count>'
- Apply filters:
- Isoforms must have been found at least
--rep_filter
times - The sum/max TPM must be at least
--tpm_filter
- Multiply these filters by
--high_conf
for suspected artifacts (fragments and fusions) - The spliced transcript length must be at least
--min_len
nucleotides - The percentage of capped 5' signal must be at least
--cap_percent
- Isoforms must have been found at least
Bugfixes
- Changes to
bookend elr --sj_shift
in v1.1 allowed malformed exons with zero or negative length. - Added
--max_intron
to utilitieselr
,assemble
, andcondense
assemble
,condense
andelr
utilities now check for and discard malformed entries with negative exon lengthsbookend elr
: terminal exons with noncanonical gaps are discardedbookend elr
: it is now possible to use all three sources of splice junction evidence together (--splice, --reference, --genome)- Summary log of
bookend label
no longer counts--discard_untrimmed
reads in Total Output bookend elr
: refactored softclipping decision tree to better identify untrimmed 5' and 3' oligosbookend label
: now retreives UMIs from adapters in either forward or reverse orientationbookend label
: extended the maximum phred score from 40 to 60bookend label
: the UMI sequence can be comprised of IUPAC ambiguity characters other than Nbookend label
: oligomer extensions (e.g. TTTT+) cannot exceed --max_endbookend label
: mismatches are no longer tolerated in the last 5nt of an oligomerbookend label
: best trim is now determined by closest sequence match, not by maximum trim lengthbookend classify
: now treats single-exon transcripts less than half the length of their matching transcript as a 'fragment'
v1.1.1
Bugfix to v1.1.0 that irons out problems encountered in bookend elr
when splice junction evidence is provided from multiple sources.
Splice junction evidence hierarchy:
- A reference splice junction database. This includes all splice junctions provided by an annotation file (BED12/GTF/GFF3) by the argument
--reference
, and/or all introns from a BED6 or STAR SJ.out.tab file provided by--splice
. These are treated as maximally reliable, and an unannotated junction that is--sj_shift
or fewer nucleotides away will be shifted to match the SJDB junction. - Genomic motif, if a
--genome
is provided. The intron boundaries are checked for a strand-specific canonical or semi-canonical motif. Accepted motifs: GT-AG, GC-AG, AT-AC, GA-AG for forward-stranded alignments, and CT-AC, CT-GC GT-AT, CT-TC for reverse-stranded alignments. - 'XS' and 'ts' tags passed from aligners. These are specific to the alignment settings of STAR/Hisat/minimap, so they can be of variable confidence.
Bookend v1.1.0: ONT update
Bookend update to implement a number of improvements, including easier handling of Oxford Nanopore (ONT) reads.
CHANGES:
bookend elr
: a number of--data_type
presets were added for "pacbio", "ont", "direct_rna", and "smartseq" libraries, and it will attempt to adjust settings appropriately to recognize the strand orientation and end labels for each data type.bookend elr
: if cDNA reads are used without trimming end labels, the argument--untrimmed
allows correct inference of RNA strand from the sequence composition of the softclipped ends of each alignment. Use for ONT Direct cDNA and PCR cDNA data if it was not pre-processed withbookend label
.bookend elr
: added the choice to provide a set of reference splice junctions with either--reference
(GTF/GFF3/BED12) or--splice
(BED6/SJ.out.tab). This prevents noncanonical reference junctions from being discarded.bookend assemble
: can now recognize a sorted aligned long-read BAM file and attempts a "simple assembly". NOT recommended as a default, but useful for a quick look at whether Bookend will work with the provided alignments. For recommendations on how to properly process ONT/PacBio reads, see the updated Bookend User Guidebookend assemble
: added a--truncation_filter
so that the aggressiveness of filtering putative RNA degradation artifacts can be user specified.bookend fasta
: added the option--orf
to report the longest open reading frame translation as an amino acid FASTA file for an assembly/annotation.
BUGFIXES:
- More efficient pruning of low-abundance splice junctions during Membership Matrix construction
Major Release v1.0.1
Bookend Release Version 1.0.1
A number of behavioral changes, bugfixes, and new utilities are introduced in this new major release. See the new v1.0.1 Bookend User Guide for current utility usage and arguments.
Behavioral changes:
- All utilities can now stream data directly from gzipped input file(s).
- For all utilities, all argument defaults are now displayed in the --help text.
- bookend elr no longer writes unsorted ELR files; default ELR output will always be position-sorted.
- bookend elr identifies additional end tags through softclipped alignments that were too short to be called by bookend label.
- Column 7 of an ELR file may now contain a triple of scores so that the number of start/end tags is not lost during condensed assembly (formatted as three floats between pipe symbols, cov|start|end).
- bookend label reverses all reads of an input FASTQ file if using the argument --strand reverse.
- bookend assemble no longer uses source information by default; use argument --use_sources to enable.
- bookend assemble GTF output attributes "S.reads", "S.capped", and "E.reads" now contain the proportional weight of the tag clusters assigned to that isoform, rather than the full weight of the tag cluster.
- bookend assemble pre-filters branchpoints from the Membership Matrix if an adjacent gap in read coverage would prevent them from being in a complete path.
New utilities:
- bookend bedgraph writes Bedgraph-format files of read coverage or tag abundance by genomic position.
- bookend fasta takes an annotation file (GTF/GFF3/BED12) and a genome FASTA, and writes a transcript-level FASTA file.
- bookend gtf-ends writes a BED file of the unique set of 5' or 3' clusters represented in an annotation (GTF/GFF3/BED12).
Bugfixes:
- Prevented dropping the last temporary block of reads during bookend elr-sort.
- In bookend assemble, corrected the Overlap Matrix decision tree to allow for mutual containment.
- bookend assemble no longer terminates new path construction at a locus if a new path gets trimmed to match an existing path.
- Prevented duplicate ELR headers in output of bookend elr-combine.
- bookend condense no longer skips chunks of length 1
- FASTA utilities used the wrong complement to the IUPAC ambiguity code "R".
- Prevented malformed transcript ends from being introduced during bookend assemble by get_transcript_attributes