Skip to content

Releases: Gregor-Mendel-Institute/bookend

v1.2.1

25 Sep 15:06
Compare
Choose a tag to compare

Bugfixes for new softclipping behavior and splice junction filtering.

  • In v1.2.0 the filter --max_intron was incorrectly implemented in bookend condense, producing a crash
  • Default behavior for bookend elr --data_type ONT now assumes start and end tags on all reads
  • bookend elr now does not require softclipping to recognize a start tag passed in the read name or with the argument -s
  • Swapped bookend elr --remove_noncanonical for --allow_noncanonical, making discarding noncanonical splice junctions default behavior

Bookend v1.2.0: merge update

19 Sep 16:14
Compare
Choose a tag to compare

Feature addition for Bookend to implement bookend merge. This new utility lets you integrate one or more assemblies into a reference annotation, following gene and transcript naming conventions. Reference transcripts with a matching assembly will have their 5' and 3' ends updated, and they will be given evidence attributes that describe how many times they were assembled and in which samples.

Merge behavior:

  1. Process transcripts in descending order of total genomic length
  2. Merge assemblies first, applying no filters
    • Combine the attributes of merged transcripts according to --attr_merge: sum or mean of expression values (TPM, cov, S.reads, E.reads)
    • All assemblies classified as 'full_match' to another assembly will be combined into a single transcript model
  3. Integrate merged assemblies with reference (decreasing length):
    • Determine the class of each merged transcript vs. reference
    • Name the transcript and add it to the reference list
      • 'full_match' retain the original transcript_id
      • 'exon_match' transcripts with 5' and/or 3' variation are named after the matching transcript_id with an extra suffix '_<count>'
      • Novel isoforms are given the gene_id with a suffix '.i<count>'
      • Novel antisense transcripts receive the '-AS' suffix
      • Intronic transcripts: '-IT' suffix
      • Intergenic transcripts are named 'BOOKEND_<count>'
  4. Apply filters:
    • Isoforms must have been found at least --rep_filter times
    • The sum/max TPM must be at least --tpm_filter
    • Multiply these filters by --high_conf for suspected artifacts (fragments and fusions)
    • The spliced transcript length must be at least --min_len nucleotides
    • The percentage of capped 5' signal must be at least --cap_percent

Bugfixes

  • Changes to bookend elr --sj_shift in v1.1 allowed malformed exons with zero or negative length.
  • Added --max_intron to utilities elr, assemble, and condense
  • assemble, condense and elr utilities now check for and discard malformed entries with negative exon lengths
  • bookend elr: terminal exons with noncanonical gaps are discarded
  • bookend elr: it is now possible to use all three sources of splice junction evidence together (--splice, --reference, --genome)
  • Summary log of bookend label no longer counts --discard_untrimmed reads in Total Output
  • bookend elr: refactored softclipping decision tree to better identify untrimmed 5' and 3' oligos
  • bookend label: now retreives UMIs from adapters in either forward or reverse orientation
  • bookend label: extended the maximum phred score from 40 to 60
  • bookend label: the UMI sequence can be comprised of IUPAC ambiguity characters other than N
  • bookend label: oligomer extensions (e.g. TTTT+) cannot exceed --max_end
  • bookend label: mismatches are no longer tolerated in the last 5nt of an oligomer
  • bookend label: best trim is now determined by closest sequence match, not by maximum trim length
  • bookend classify: now treats single-exon transcripts less than half the length of their matching transcript as a 'fragment'

v1.1.1

01 Jun 11:43
Compare
Choose a tag to compare

Bugfix to v1.1.0 that irons out problems encountered in bookend elr when splice junction evidence is provided from multiple sources.

Splice junction evidence hierarchy:

  1. A reference splice junction database. This includes all splice junctions provided by an annotation file (BED12/GTF/GFF3) by the argument --reference, and/or all introns from a BED6 or STAR SJ.out.tab file provided by --splice. These are treated as maximally reliable, and an unannotated junction that is --sj_shift or fewer nucleotides away will be shifted to match the SJDB junction.
  2. Genomic motif, if a --genome is provided. The intron boundaries are checked for a strand-specific canonical or semi-canonical motif. Accepted motifs: GT-AG, GC-AG, AT-AC, GA-AG for forward-stranded alignments, and CT-AC, CT-GC GT-AT, CT-TC for reverse-stranded alignments.
  3. 'XS' and 'ts' tags passed from aligners. These are specific to the alignment settings of STAR/Hisat/minimap, so they can be of variable confidence.

Bookend v1.1.0: ONT update

20 Apr 11:11
Compare
Choose a tag to compare

Bookend update to implement a number of improvements, including easier handling of Oxford Nanopore (ONT) reads.

CHANGES:

  • bookend elr: a number of --data_type presets were added for "pacbio", "ont", "direct_rna", and "smartseq" libraries, and it will attempt to adjust settings appropriately to recognize the strand orientation and end labels for each data type.
  • bookend elr: if cDNA reads are used without trimming end labels, the argument --untrimmed allows correct inference of RNA strand from the sequence composition of the softclipped ends of each alignment. Use for ONT Direct cDNA and PCR cDNA data if it was not pre-processed with bookend label.
  • bookend elr: added the choice to provide a set of reference splice junctions with either --reference (GTF/GFF3/BED12) or --splice (BED6/SJ.out.tab). This prevents noncanonical reference junctions from being discarded.
  • bookend assemble: can now recognize a sorted aligned long-read BAM file and attempts a "simple assembly". NOT recommended as a default, but useful for a quick look at whether Bookend will work with the provided alignments. For recommendations on how to properly process ONT/PacBio reads, see the updated Bookend User Guide
  • bookend assemble: added a --truncation_filter so that the aggressiveness of filtering putative RNA degradation artifacts can be user specified.
  • bookend fasta: added the option --orf to report the longest open reading frame translation as an amino acid FASTA file for an assembly/annotation.

BUGFIXES:

  • More efficient pruning of low-abundance splice junctions during Membership Matrix construction

Major Release v1.0.1

03 May 09:49
Compare
Choose a tag to compare

Bookend Release Version 1.0.1

A number of behavioral changes, bugfixes, and new utilities are introduced in this new major release. See the new v1.0.1 Bookend User Guide for current utility usage and arguments.

Behavioral changes:

  • All utilities can now stream data directly from gzipped input file(s).
  • For all utilities, all argument defaults are now displayed in the --help text.
  • bookend elr no longer writes unsorted ELR files; default ELR output will always be position-sorted.
  • bookend elr identifies additional end tags through softclipped alignments that were too short to be called by bookend label.
  • Column 7 of an ELR file may now contain a triple of scores so that the number of start/end tags is not lost during condensed assembly (formatted as three floats between pipe symbols, cov|start|end).
  • bookend label reverses all reads of an input FASTQ file if using the argument --strand reverse.
  • bookend assemble no longer uses source information by default; use argument --use_sources to enable.
  • bookend assemble GTF output attributes "S.reads", "S.capped", and "E.reads" now contain the proportional weight of the tag clusters assigned to that isoform, rather than the full weight of the tag cluster.
  • bookend assemble pre-filters branchpoints from the Membership Matrix if an adjacent gap in read coverage would prevent them from being in a complete path.

New utilities:

  • bookend bedgraph writes Bedgraph-format files of read coverage or tag abundance by genomic position.
  • bookend fasta takes an annotation file (GTF/GFF3/BED12) and a genome FASTA, and writes a transcript-level FASTA file.
  • bookend gtf-ends writes a BED file of the unique set of 5' or 3' clusters represented in an annotation (GTF/GFF3/BED12).

Bugfixes:

  • Prevented dropping the last temporary block of reads during bookend elr-sort.
  • In bookend assemble, corrected the Overlap Matrix decision tree to allow for mutual containment.
  • bookend assemble no longer terminates new path construction at a locus if a new path gets trimmed to match an existing path.
  • Prevented duplicate ELR headers in output of bookend elr-combine.
  • bookend condense no longer skips chunks of length 1
  • FASTA utilities used the wrong complement to the IUPAC ambiguity code "R".
  • Prevented malformed transcript ends from being introduced during bookend assemble by get_transcript_attributes

Bookend

12 Jan 12:06
45c4c9f
Compare
Choose a tag to compare
Bookend Pre-release
Pre-release

Initial public pre-release of the Bookend transcript assembly package.
Corresponds to PyPI package 'bookend-rna' v0.1.3