-
Notifications
You must be signed in to change notification settings - Fork 59
Troubleshooting
Sometimes, Circlator may not work as expected with the default settings, and miss some contig merges, or miss circularizing a circular sequence. This page describes the most common reasons for this, and how to interpret the log files to find out what happened.
Circlator is most sensitive to the length of contig ends that are reassembled. This is
determined by circlator all --b2r_length_cutoff X
when running the whole
pipeline, or by circlator bam2reads --length_cutoff X
, when just extracting reads mapped to
contig ends. The value X
determines the cutoff between 'short' and 'long' contigs.
All reads mapped to contigs no longer than X
are used for reassembly. For contigs
longer than X
, the reads mapped within X/2
of the contig's ends are used for
reassembly. A larger value of X
means that more sequence is reassembled, which increases
the chances of unique sequence in the reassembly (particularly if the input contigs
end in repeats). However, making X
very large means that a large proportion of the
genome is being reassembled, which could result in a longer run time and more assembly
errors. Typically, X
should be set to at least twice the read length. Relatively small
values of X
can result in SPAdes failing to produce an assembly.
The other parameters that are most likely to affect the results are:
-
--merge_min_length_merge
: the minimum length of nucmer match; -
--merge_reassemble_end
: the maximum allowed distance between a nucmer match and the end of a reassembly SPAdes contig; -
--merge_ref_end
: the maximum allowed distance between a nucmer match and the end of an input assembly contig.
During iterative contig merging, Circlator writes verbose output to a log file
called 04.merge.merge.iterations.log
(if you ran circlator all
, otherwise the
filename has a different prefix if you ran circlator merge
). There are two things
you can do to check what happened, and why merges were made or missed:
- Check in the details of
04.merge.merge.iterations.log
. - Use ACT (Artemis Comparison Tool) to compare the current state of the assembly against the SPAdes reassembled contigs. At each iteration, Circlator writes a script that will open ACT showing the comparison of the assemblies relevant to that iteration. When ACT starts, if the "Indexed FASTA Found" message appears, choose "Concatenate Sequences", not "Use index".
The first two lines of each iteration detail the file of nucmer matches being used, and how to view it in ACT. For example, iteration 1 will look like this in the log file:
[merge iterative_merge 1] Using nucmer matches from 04.merge.merge.iter.1.coords
[merge iterative_merge 1] You can view the nucmer matches with ACT using: ./04.merge.merge.iter.1.start_act.sh
Using ACT is often the most informative method, since assemblies can vary dramatically, and so the log file cannot describe every possible situation. Some possible scenarios are shown below
First, Circlator checks each SPAdes reassembly contig in turn, to see if it can be used to join two of the input assembly contigs. It reports any potential pair of hits in the log file, for example:
[merge iterative_merge 1] NODE_4_length_78927_cov_20.3595_ID_2334: checking nucmer matches
[merge iterative_merge 1] NODE_4_length_78927_cov_20.3595_ID_2334: potential pair of hits to merge contigs...
[merge iterative_merge 1] NODE_4_length_78927_cov_20.3595_ID_2334: 1 58376 58374 1 58376 58374 99.99 62258 78927 1 unitig_5 NODE_4_length_78927_cov_20.3595_ID_2334
[merge iterative_merge 1] NODE_4_length_78927_cov_20.3595_ID_2334: 1436 24901 78927 55467 23466 23461 99.94 24901 78927 1 unitig_12 NODE_4_length_78927_cov_20.3595_ID_2334
This means that one of the matches is at the start of the SPAdes contig NODE_4, and the other is at the end. Also, the orientation and distances from the contig ends are OK. Next, Circlator checks if each of these matches have the correct orientation in the input contigs (in this case unitig_5 and unitig_12), and if there are longer hits to elsewhere in the genome. Based on these checks, the pair of matches can be accepted or rejected.
Lines in the log file showing a successful merge will look like this:
[merge iterative_merge 1] NODE_4_length_78927_cov_20.3595_ID_2334: checking nucmer matches
[merge iterative_merge 1] NODE_4_length_78927_cov_20.3595_ID_2334: potential pair of hits to merge contigs...
[merge iterative_merge 1] NODE_4_length_78927_cov_20.3595_ID_2334: 1 58376 58374 1 58376 58374 99.99 62258 78927 1 unitig_5 NODE_4_length_78927_cov_20.3595_ID_2334
[merge iterative_merge 1] NODE_4_length_78927_cov_20.3595_ID_2334: 1436 24901 78927 55467 23466 23461 99.94 24901 78927 1 unitig_12 NODE_4_length_78927_cov_20.3595_ID_2334
[merge iterative_merge 1] NODE_4_length_78927_cov_20.3595_ID_2334: might be used - no longer hits elsewhere and orientation/distance to ends OK
and later in the log:
[merge iterative_merge 1] Merging contigs unitig_5 and unitig_12 using these two hits:
[merge iterative_merge 1] 1 58376 58374 1 58376 58374 99.99 62258 78927 1 unitig_5 NODE_4_length_78927_cov_20.3595_ID_2334
[merge iterative_merge 1] 1436 24901 78927 55467 23466 23461 99.94 24901 78927 1 unitig_12 NODE_4_length_78927_cov_20.3595_ID_2334
The SPAdes contig NODE_4 was used to merge the input assembly contigs unitig_5 and unitig_7.
There must be no longer match from the SPAdes contig to somewhere other than the two input assembly contigs that are being considered for merging. If there is a longer match, the log file contains output like this:
[merge iterative_merge 1] NODE_2_length_92383_cov_16.1266_ID_1432: 1 50006 49851 1 50006 49851 99.67 837058 92383 1 unitig_7 NODE_2_length_92383_cov_16.1266_ID_1432
[merge iterative_merge 1] NODE_2_length_92383_cov_16.1266_ID_1432: 33430 50001 75814 92383 16572 16570 99.82 195489 92383 1 unitig_2 NODE_2_length_92383_cov_16.1266_ID_1432
[merge iterative_merge 1] NODE_2_length_92383_cov_16.1266_ID_1432: rejected - there is a longer hit to elsewhere
The matches must also have the correct orientation and not be too far from the input assembly contig ends. If this is not the case, then the log file will look like this:
[merge iterative_merge 1] NODE_2_length_92383_cov_16.1266_ID_1432: 1 50006 49851 1 50006 49851 99.67 837058 92383 1 unitig_7 NODE_2_length_92383_cov_16.1266_ID_1432
[merge iterative_merge 1] NODE_2_length_92383_cov_16.1266_ID_1432: 33430 50001 75814 92383 16572 16570 99.82 195489 92383 1 unitig_2 NODE_2_length_92383_cov_16.1266_ID_1432
[merge iterative_merge 1] NODE_2_length_92383_cov_16.1266_ID_1432: rejected - orientation/distance from ends not correct to make a merge
Note that there could be a longer hit to elswehere, and the orientation is bad, in which case both 'rejected' lines appear in the output:
[merge iterative_merge 1] NODE_2_length_92383_cov_16.1266_ID_1432: 1 50006 49851 1 50006 49851 99.67 837058 92383 1 unitig_7 NODE_2_length_92383_cov_16.1266_ID_1432
[merge iterative_merge 1] NODE_2_length_92383_cov_16.1266_ID_1432: 33430 50001 75814 92383 16572 16570 99.82 195489 92383 1 unitig_2 NODE_2_length_92383_cov_16.1266_ID_1432
[merge iterative_merge 1] NODE_2_length_92383_cov_16.1266_ID_1432: rejected - there is a longer hit to elsewhere
[merge iterative_merge 1] NODE_2_length_92383_cov_16.1266_ID_1432: rejected - orientation/distance from ends not correct to make a merge
It is usually easiest to visualise the results in ACT, in addition to using the information in the log file. If what looks like a plausible merge was missed by Circlator, then the reason should be clear when using ACT. For example, the two ACT screenshots below show a failed contig merge. SPAdes did not produce a contig that spanned the ends of unitig_1 and unitig_7 (in light brown at the top), with them assembled into separate contigs.
However, decreasing the value of --b2r_length_cutoff
resulted in a more contiguous assembly
(see ACT screenshots below),
so that the SPAdes contig NODE_2 (in dark brown at the bottom) can be used to
merge the input assembly contigs unitig_1 and unitig_7 (in light brown at the top).
If the merge failed because the nucmer matches were too far from the contig ends, then
this would also be clear in ACT. In which case, increasing one (or both) of the options
--merge_reassemble_end
and --merge_ref_end
would fix the problem.
Circlator writes a verbose log file containing details of the
circularization process.
This file is called 04.merge.circularise_details.log
(if you ran circlator all
, otherwise the
filename has a different prefix if you ran circlator merge
).
Similarly to the merge stage, this file can be used in conjunction
with ACT to find out why circularizations were or were not made.
The script to run ACT is called
04.merge.circularise.start_act.sh
(it will have a different prefix if circlator merge
was used
instead of circlator all
).
First, Circlator checks each input assembly contig for a pair of nucmer matches to a SPAdes reassembly contig, that can be used for circularization. The first section of the log file shows the details of this process. For example:
[merge circularise_details] unitig_0 Checking 0 nucmer hits
[merge circularise_details] unitig_1 Checking 3 nucmer hits
[merge circularise_details] unitig_1 potential pair of nucmer hits for circularization:
[merge circularise_details] unitig_1 1 50000 42532 92375 50000 49844 99.67 5323329 92375 1 unitig_1 NODE_1_length_92375_cov_16.2004_ID_1471
[merge circularise_details] unitig_1 5286971 5323329 15341 51485 36359 36145 99.38 5323329 92375 1 unitig_1 NODE_1_length_92375_cov_16.2004_ID_1471
[merge circularise_details] unitig_1 cannot use this pair because positions/orientations of matches no good
[merge circularise_details] unitig_2 Checking 1 nucmer hits
[merge circularise_details] unitig_3 Checking 1 nucmer hits
[merge circularise_details] unitig_4 Checking 2 nucmer hits
[merge circularise_details] unitig_4 potential pair of nucmer hits for circularization:
[merge circularise_details] unitig_4 1 28924 59443 88367 28924 28925 99.97 96564 88367 1 unitig_4 NODE_2_length_88367_cov_17.6052_ID_1473
[merge circularise_details] unitig_4 28221 96564 1 68346 68344 68346 99.98 96564 88367 1 unitig_4 NODE_2_length_88367_cov_17.6052_ID_1473
[merge circularise_details] unitig_4 can use this pair of hits
In this example:
- unitig_0 had no nucmer matches, so cannot be circularized using pairs of matches
- unitig_1 had a pair of nucmer matches with positions that looked OK in the input contig unitig_1. However, the matches were rejected due to their positions and/or orientations in the reassembled contig (NODE_1).
- unitig_2 and unitig_3 had one match each, so cannot be circularized using pairs of matches
- unitig_4 has a pair of matches to NODE_2, which can be used to circularize unitig_4.
The above report only includes matches that are longer than the minimum length cutoff,
defined by --merge_min_length_merge
, which is 4000 by default.
Next, Circlator checks for any reassembly SPAdes contigs that SPAdes called as circular. This is reported in a line in the log file, like this:
[merge circularise_details] SPAdes reassembly contigs that are circular: NODE_2_length_88367_cov_17.6052_ID_1473,NODE_4_length_267_cov_1.07333_ID_1477
In this example, NODE_2 and NODE_4 are circular.
Finally, Circlator tries to circularize each of the input contigs in turn. The results are reported in the log file. For each input contig, Circlator tries to use a match to a SPAdes circular contig. If this is not possible, then it will use a suitable pair of nucmer matches (reported earlier), if such a pair was found.
Continuing the example from above, the output for unitig_4 looks as follows:
[merge circularise_details] unitig_4 Trying to circularize. Has nucmer hits to check...
[merge circularise_details] unitig_4 Checking hit: 1 28924 59443 88367 28924 28925 99.97 96564 88367 1 unitig_4 NODE_2_length_88367_cov_17.6052_ID_1473
[merge circularise_details] unitig_4 Not using hit to call as circular (hit too short)
then several similar lines, all rejecting the nucmer matches to the circular SPAdes contig NODE_2. No suitable match is found to any circular SPAdes contigs, so the pair of matches reported earlier is used to circularize unitig_4:
[merge circularise_details] unitig_4 No suitable matches to SPAdes circular contigs
[merge circularise_details] unitig_4 Could not circularize using matches to SPAdes circular contigs
[merge circularise_details] unitig_4 Circularizing using this pair of nucmer matches to SPAdes contig:
[merge circularise_details] unitig_4 1 28924 59443 88367 28924 28925 99.97 96564 88367 1 unitig_4 NODE_2_length_88367_cov_17.6052_ID_1473
[merge circularise_details] unitig_4 28221 96564 1 68346 68344 68346 99.98 96564 88367 1 unitig_4 NODE_2_length_88367_cov_17.6052_ID_1473
[merge circularise_details] unitig_4 Circularized: yes
However, unitig_0 could not be circularized:
[merge circularise_details] unitig_0 Trying to circularize. Has nucmer hits to check...
[merge circularise_details] unitig_0 Checking hit: 15103 16439 85426 86762 1337 1337 100.00 27823 88367 1 unitig_0 NODE_2_length_88367_cov_17.6052_ID_1473
[merge circularise_details] unitig_0 Not using hit to call as circular (hit too short)
[merge circularise_details] unitig_0 No suitable matches to SPAdes circular contigs
[merge circularise_details] unitig_0 Could not circularize using matches to SPAdes circular contigs
[merge circularise_details] unitig_0 Cannot circularize: no suitable nucmer hits
[merge circularise_details] unitig_0 Circularized: no
Note that there is a nucmer match to be checked, despite 0 hits reported earlier, because the earlier report only includes matches above a minimum length (default 4000bp). In this part of the pipeline, all matches are considered (so as not to exclude plasmids shorter than 4000bp)
Finally, here is an example where a SPAdes circular contig (NODE_9) is used to circularize an input contig (unitig_6). In this case, unitig_6 is multiple tandem copies of a plasmid, which SPAdes correctly assembled into one copy (NODE_9). This is the situation viewed in ACT:
The report in the log file begins like this:
[merge circularise_details] unitig_6 Trying to circularize. Has nucmer hits to check...
[merge circularise_details] unitig_6 Checking hit: 1 4528 4527 1 4528 4527 99.98 18939 4556 1 unitig_6 NODE_9_length_4556_cov_19.9306_ID_2485
[merge circularise_details] unitig_6 Hit is long enough. Percent of contig covered by hit is 99.36347673397718
unitig_6 has a match to NODE_9 that covers 99.36 percent of NODE_9 (4527 of 4556 bp). Next, Circlator checks if enough of unitig_6 is covered by one or more nucmer matches to NODE_9:
[merge circularise_details] unitig_6 reference bases covered by spades contig:(0,18938)
[merge circularise_details] unitig_6 ... which is 100.0 percent of 18939 bases
[merge circularise_details] unitig_6 Using hit to call as circular (enough bases covered)
[merge circularise_details] unitig_6 Circularized using matches to SPAdes circular contigs
[merge circularise_details] unitig_6 Circularized: yes
and we see that unitig_6 was circularized.