Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bogart failure in Canu 2.1 - error with AS_BAT_MarkRepeatReads #1806

Closed
dportik opened this issue Sep 30, 2020 · 8 comments
Closed

Bogart failure in Canu 2.1 - error with AS_BAT_MarkRepeatReads #1806

dportik opened this issue Sep 30, 2020 · 8 comments

Comments

@dportik
Copy link

dportik commented Sep 30, 2020

Hello,
I am performing a metagenome assembly with PacBio HiFi data and have run into a repeatable error. I am using canu 2.1 with:
canu -d STD -p Zymo6331-STD -pacbio-hifi Zymo6331-STD.fastq genomeSize=100m maxInputCoverage=1000 batMemory=200

I am running on SGE with the same configuration I have used to run other assemblies successfully (gridEngineResourceOption="-pe smp THREADS -l mem_free=MEMORY" and gridOptions="-V -S /bin/bash -q bigmem").

I am working with three samples, each run using the same set of arguments as above (but with different -d, -p, and fastq names). Two finished without any issues. For one sample, the run eventually fails. I scanned the error message and it suggested trying to run it again. I did, and it produced the same error again. It suggests the issue occurs when running Bogart:

Found perl:
   /bin/perl
   This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

Found java:
   /mnt/software/j/jre/1.8.0_144/bin/java
   java version "1.8.0_144"

Found canu:
   /mnt/software/c/canu/2.1/bin/canu
   canu 2.1

-- canu 2.1
--
-- CITATIONS
--
-- For assemblies of PacBio HiFi reads:
--   Nurk S, Walenz BP, Rhiea A, Vollger MR, Logsdon GA, Grothe R, Miga KH, Eichler EE, Phillippy AM, Koren S.
--   HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads.
--   biorXiv. 2020.
--   https://doi.org/10.1101/2020.03.14.992248
-- 
-- Read and contig alignments during correction and consensus use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_144' (from '/mnt/software/j/jre/1.8.0_144/bin/java') with -d64 support.
-- Detected gnuplot version '4.6 patchlevel 2   ' (from 'gnuplot') and image format 'png'.
-- Detected 72 CPUs and 504 gigabytes of memory.
-- Detected Sun Grid Engine in '/usr/share/gridengine/default'.
-- User supplied Parallel Environment 'smp'.
-- User supplied Memory Resource      'mem_free'.
-- 
-- Found  17 hosts with  80 cores and  754 GB memory under Sun Grid Engine control.
-- Found   1 host  with  48 cores and  440 GB memory under Sun Grid Engine control.
-- Found   4 hosts with  24 cores and   47 GB memory under Sun Grid Engine control.
-- Found  18 hosts with  48 cores and  251 GB memory under Sun Grid Engine control.
-- Found  14 hosts with  16 cores and   46 GB memory under Sun Grid Engine control.
-- Found  14 hosts with  80 cores and  755 GB memory under Sun Grid Engine control.
-- Found   1 host  with  24 cores and   39 GB memory under Sun Grid Engine control.
-- Found  14 hosts with  56 cores and  503 GB memory under Sun Grid Engine control.
-- Found   2 hosts with  24 cores and   31 GB memory under Sun Grid Engine control.
-- Found  14 hosts with  72 cores and  503 GB memory under Sun Grid Engine control.
-- Found  13 hosts with  24 cores and   94 GB memory under Sun Grid Engine control.
-- Found  26 hosts with  48 cores and  503 GB memory under Sun Grid Engine control.
-- Found   4 hosts with 256 cores and  503 GB memory under Sun Grid Engine control.
-- Found   2 hosts with  16 cores and   94 GB memory under Sun Grid Engine control.
-- Found   1 host  with  48 cores and  188 GB memory under Sun Grid Engine control.
-- Found   1 host  with  24 cores and  141 GB memory under Sun Grid Engine control.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     13.000 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       10.000 GB    8 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   10.000 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     8.000 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl     8.000 GB    8 CPUs  (overlap detection)
-- Grid:  cor       16.000 GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs        8.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red        8.000 GB    4 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      200.000 GB    8 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
--
-- In 'Zymo6331-ULI.seqStore', found PacBio HiFi reads:
--   PacBio HiFi:              1
--
--   Corrected:                1
--   Corrected and Trimmed:    1
--
-- Generating assembly 'Zymo6331-ULI' in '/dept/appslab/projects/old/2020/dp_Zymo/HiCanu_Assembly/ULI':
--    - assemble HiFi reads.
--
-- Parameters:
--
--  genomeSize        100000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.0000 (  0.00%)
--    obtOvlErrorRate 0.0250 (  2.50%)
--    utgOvlErrorRate 0.0100 (  1.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.0000 (  0.00%)
--    obtErrorRate    0.0250 (  2.50%)
--    utgErrorRate    0.0100 (  1.00%)
--    cnsErrorRate    0.0500 (  5.00%)
--
--
-- BEGIN ASSEMBLY
--
--
-- Bogart failed, tried 2 times, giving up.
--

ABORT:
ABORT: canu 2.1
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT: Disk space available:  91264.777 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (unitigging/4-unitigger/unitigger.err):
ABORT:
ABORT:   mergeOrphans()-- ignored       0               tigs with 0 reads; failed to place
ABORT:   mergeOrphans()--
ABORT:   
ABORT:   ==> MARK SIMPLE BUBBLES.
ABORT:       using 0.010000 user-specified threshold
ABORT:   
ABORT:   
ABORT:   findPotentialOrphans()-- working on 4954 tigs.
ABORT:   findPotentialOrphans()-- found 3866 potential orphans.
ABORT:   mergeOrphans()-- flagged    3691        bubble tigs with 550991 reads
ABORT:   mergeOrphans()-- placed        0 unique orphan tigs with 0 reads
ABORT:   mergeOrphans()-- shattered     0 repeat orphan tigs with 0 reads
ABORT:   mergeOrphans()-- ignored       1               tigs with 38 reads; failed to place
ABORT:   mergeOrphans()--
ABORT:   classifyAsUnassembled()--      0 tigs           0 bases -- singleton
ABORT:   classifyAsUnassembled()--      0 tigs           0 bases -- too few reads        (< 2 reads)
ABORT:   classifyAsUnassembled()--      0 tigs           0 bases -- too short            (< 0 bp)
ABORT:   classifyAsUnassembled()--      0 tigs           0 bases -- single spanning read (> 1.000000 tig length)
ABORT:   classifyAsUnassembled()--     58 tigs      687914 bases -- low coverage         (> 0.500000 tig length at < 3 coverage)
ABORT:   classifyAsUnassembled()--   4758 tigs   111950257 bases -- acceptable contigs
ABORT:   
ABORT:   
ABORT:   ==> GENERATING ASSEMBLY GRAPH.
ABORT:   
ABORT:   computeErrorProfiles()-- Computing error profiles for 4954 tigs, with 8 threads.
ABORT:   computeErrorProfiles()-- Finished.
ABORT:   
ABORT:   AssemblyGraph()-- allocating vectors for placements, 117.061MB
ABORT:   AssemblyGraph()-- finding edges for 885032 reads (821084 contained), ignoring 1672196 unplaced reads, with 8 threads.
ABORT:   AssemblyGraph()-- building reverse edges.
ABORT:   AssemblyGraph()-- build complete.
ABORT:   
ABORT:   ==> BREAK REPEATS.
ABORT:   
ABORT:   computeErrorProfiles()-- Computing error profiles for 4954 tigs, with 8 threads.
ABORT:   computeErrorProfiles()-- Finished.
ABORT:   bogart: bogart/AS_BAT_MarkRepeatReads.C:887: std::vector<breakReadEnd> buildBreakPoints(TigVector&, Unitig*, intervalList<int>&, intervalList<int>&, std::vector<confusedEdge>&): Assertion `isRepeat == true' failed.
ABORT:   
ABORT:   Failed with 'Aborted'; backtrace (libbacktrace):
ABORT:   utility/src/utility/system-stackTrace.C::83 in _Z17AS_UTL_catchCrashiP7siginfoPv()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:   bogart/AS_BAT_MarkRepeatReads.C::887 in _Z16buildBreakPointsR9TigVectorP6UnitigR12intervalListIiES5_RSt6vectorI12confusedEdgeSaIS7_EE()
ABORT:   bogart/AS_BAT_MarkRepeatReads.C::1051 in _Z15markRepeatReadsP13AssemblyGraphR9TigVectordjdRSt6vectorI12confusedEdgeSaIS4_EE()
ABORT:   bogart/bogart.C::656 in main()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:

Any suggestions as to what might be going wrong here? I have attached the unitigger.err file here too (unitigger.err.txt), in case that helps.

Thanks!
Dan

@brianwalenz
Copy link
Member

Can you share the overlaps (unitigging/.ovlStore) and read metadata (.seqStore, excluding the 'blobs' files)? The sequence data is stored in the 'blobs' files, and I don't need that to debug.

Upload directions are in the FAQ.

@dportik
Copy link
Author

dportik commented Oct 1, 2020

Sure thing. I just sent over the contents of Zymo6331-ULI.seqStore/ (minus the blobs) as Zymo6331-ULI.seqStore.tar.gz (74MB). The contents of unitigging/Zymo6331-ULI.ovlStore/ were quite large. I started the transfer for Zymo6331-ULI.ovlStore.tar.gz (8.8GB) but that may take some time to finish. If I should only include a subset of the contents for the unitigging/Zymo6331-ULI.ovlStore/ please let me know.

@brianwalenz
Copy link
Member

Great! Thanks! Shouldn't be too hard to fix now, but I won't get to it for another 16 hours.

Unfortunately, the whole ovlStore is needed.

@dportik
Copy link
Author

dportik commented Oct 2, 2020

I think I have found the problem on my end. These HiFi reads were generated from a new prep type and require trimming before analysis. The adapter occurs on 99% of the reads used for this failed assembly. This would explain why the error only occurred with this particular sample, and not the other two.

This is likely a bizarre edge-case for Canu, and I wouldn't expect the assembly from these untrimmed reads to be usable. You may want to close this issue unless you think the error is still worth exploring.

@brianwalenz
Copy link
Member

Wonderful! You can use "-untrimmed -pacbio-hifi" to enable trimming of hifi reads.

It's still a good bug. I've found what looks like the problem, but don't know the correct way to fix it yet.

@brianwalenz
Copy link
Member

Looking at the high-quality overlaps confirms your explanation. This command will show the overlap picture for the first 100 reads.

ovStoreDump -S ./Zymo6331-ULI.seqStore -O ./Zymo6331-ULI.ovlStore -p 1-100 -erate 0.001 | more

Here are the overlaps for read 30:

A       0:5700    A        30       0:5700       5700                   |----------------------------->|
A       0:5700    B   1441096       0:5697       8174   0.088%          |----------------------------->| +2477
A    1518:5700    B   1959246    6678:10859     10859   0.048%          |       <----------------------| +6678
A    2020:5700    B    161374    3981:7658       7658   0.000%          |          <-------------------| +3981
A    2772:5700    B     76363    5903:8833       8833   0.000%          |              <---------------| +5903
A    2780:5700    B    379232    3188:6103       6103   0.000%          |              <---------------| +3188
A    2847:5700    B    853801    2065:4918       4918   0.000%          |              <---------------| +2065
A    2878:5700    B    303282    5257:8076       8076   0.000%          |               <--------------| +5257
A    2995:5700    B   2181497    7116:9818       9818   0.000%          |               <--------------| +7116
A    3145:5700    B   2372310       0:2551       7246   0.000%          |                ------------->| +4695
A    3163:5700    B   1782366    6556:9094       9094   0.000%          |                <-------------| +6556
A    3181:5700    B   2047605    3030:5546       5546   0.000%          |                <-------------| +3030
A    3231:5700    B    260113    4555:7022       7022   0.000%          |                 <------------| +4555
A    3264:5700    B   2473761    4022:6461       6461   0.000%          |                 <------------| +4022
A    3299:5700    B   1789716    5263:7664       7664   0.000%          |                 <------------| +5263
A    3302:5700    B   1877426    4834:7230       7230   0.000%          |                 <------------| +4834
A    3313:5700    B   1931868    3069:5457       5457   0.000%          |                 <------------| +3069
A    3334:5700    B   1263467       0:2362       6952   0.000%          |                 ------------>| +4590
A    3481:5700    B   1973160    4133:6348       6348   0.000%          |                  <-----------| +4133
A    3490:5700    B    306787    2082:4290       4290   0.000%          |                  <-----------| +2082
A    3593:5700    B   2085001    1762:3865       3865   0.000%          |                  <-----------| +1762
A    3607:5700    B    911354       0:2091       6356   0.000%          |                  ----------->| +4265
A    3760:5700    B   1994932    2829:4766       4766   0.000%          |                   <----------| +2829
A    3835:5700    B   1530501       0:1861       7598   0.000%          |                    --------->| +5737
A    3835:5700    B   1529629       0:1861       5177   0.000%          |                    --------->| +3316
A    3875:5700    B    365874       0:1821       2331   0.000%          |                    --------->| +510
A    3961:5700    B    597658    5317:7050       7050   0.000%          |                    <---------| +5317
A    4180:5700    B   2529311    2530:4044       4044   0.000%          |                      <-------| +2530

No overlaps off the 5' end of read 30, and most overlaps involve the 3' end of the other read.

The first line describes the read we're showing overlaps for, the other lines are the overlapping reads. X:Y is the position on the two reads, the number before the percent identity is the length of the read. +X is the amount not aligned.

brianwalenz added a commit that referenced this issue Oct 5, 2020
@brianwalenz
Copy link
Member

Many thanks for the data. Fixed!

@dportik
Copy link
Author

dportik commented Oct 5, 2020

That makes total sense. Thanks for digging into this. I re-ran the assembly after the trimming and de-duplication steps (this library actually involves PCR amplification), and the assembly looks really good. Thanks for all your work on Canu!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants