sambamba-markdup: Read reference ID is out of range #224

travc · 2016-06-17T05:14:32Z

sambamba markdup is dying (sig 11 I think) with:
sambamba-markdup: Read reference ID is out of range

This only happens for some (one at the moment) bam file. Others work fine.

v0.6.1

command:
sambamba_v0.6.1 markdup -t 8 merged.bam merged_markdup.bam

complete output:

finding positions of the duplicate reads in the file...
  sorted 24525923 end pairs
     and 8975055 single ends (among them 0 unmatched pairs)
  collecting indices of duplicate reads...   done in 4852 ms
  found 3504671 duplicates
collected list of positions in 1 min 36 sec
marking duplicates...
sambamba-markdup: Read reference ID is out of range

The merge step was also done with sambamba. Mapping done with bwa mem.

It fails fairly quickly, but the input file it is failing on is fairly huge (5.1G). I can make it available to someone for testing (it isn't human data), but don't want to just post it.

The problem may have something to do with the reference (total size 1.3G). The 2nd scaffold (out of 3919) in the reference is 552137040 bp long, and some picard and GATK tools have been choking on it. 552137040 * 8 is > 2^32, unlike any of the human chroms... so maybe something there.

The text was updated successfully, but these errors were encountered:

lomereiter · 2016-06-17T05:41:54Z

Hi,

It sounds like a bug in merge tool. It would be helpful if you provided for each of inputs and the output of the merging step:

SAM header
dump of the binary section with reference sequence information (sambamba view --reference-info)
the reference name column (sambamba view | cut -f 3), it might segfault on the final file, that's fine.

travc · 2016-06-17T06:12:39Z

I'll email you a link if that is ok...

lomereiter · 2016-06-17T07:13:32Z

Thanks, I downloaded the files and can reproduce the issue.

lomereiter · 2016-06-19T13:06:26Z

Appears to have the same cause as #214 (buggy version of lz4 library), finished just fine with v0.6.2.

travc · 2016-06-20T05:01:35Z

Thanks. I'm busily doing workarounds for other broken tools, and was just assuming this was the same problem. I'll be sure to update before rerunning markdup.

lomereiter · 2016-06-20T05:10:15Z

I reran it several times, and occasionally it still fails :( Reopening.

travc · 2016-06-20T05:18:00Z

Apparently there is a deep limitation in BAI indexing such that the max contig size cannot exceed (2^29)-1. One of the contigs in my ref violates that. Maybe this is what is tripping up sambamba.

The workaround I mentioned above is to split that contig (it is a chromosome, so I can split it into arms with not problem). However, that doesn't really solve the problem for others and the future.

If this really is the issue, everything is moving to CRAM or at least CSI indexes eventually may be a good enough long-term fix. For now though, prominently noting the limitation and catching the condition before it causes hard-to-track errors would probably be a good idea. Again, assuming that size limit is the problem.

lomereiter · 2016-06-20T05:30:42Z

That's unlikely to be the cause, markdup doesn't do any index queries. It worst-case scenario it will write a broken .BAI index.

Might be related to #189

travc · 2016-06-20T05:40:06Z

Ugh... Deadlocks. That is above my pay grade I'm afraid.
I might have seen it happen though... I'm running through a snakemake workflow, and just assumed the problem was at that level.
This dataset I'm working on does seem to be extra problematic for pretty much all the tools, not just sambamba.

Just tell me if there's anything particular you'd like me to look out for. I feel a bit guilty not digging into the code myself to help, but I'm not going to learn D at the moment.

pwwang · 2016-06-28T19:07:58Z

I also encountered this with v0.6.3
Full output:

finding positions of the duplicate reads in the file...
  sorted 422836935 end pairs
     and 4965899 single ends (among them 0 unmatched pairs)
  collecting indices of duplicate reads...   done in 100879 ms
  found 99439311 duplicates
collected list of positions in 62 min 3 sec
marking duplicates...
sambamba-markdup: Read reference ID is out of range

Some bam files work fine, but when index the marked bam file, error occurred:
sambamba-index: Error reading BGZF block starting from offset 134206492: stream error: not enough data in stream

But they all work fine with Picard MarkDuplicates.

RoanKanninga · 2016-08-08T06:00:07Z

What's the status on this issue? We (Genetics department at UMCG) are heavily dependent on sambamba, upgrading to the latest version (0.6.3) did help for some samples/runs, but not all.
We're experiencing this in WES and WGS data.

We are using also using bwa mem and merge with sambamba..

lomereiter · 2016-08-14T14:38:25Z

Hi @RoanKanninga, could you check if compiling from source fixes the issue? I also experience it with the release binaries, but can't reproduce it in the development environment. It may be that outdated LLVM on CentOS leads to bugs like this.

yifangt · 2017-01-23T23:07:49Z

I still have this problem with v0.6.5 binary.
Any update of the issue?
$ sambamba-merge: Read reference ID is out of range
Thanks!
Yifang

sambrightman · 2017-01-24T09:15:36Z

If #189 is suspected then it'd be a good to have @RoanKanninga's kernel version(s).

pjotrp · 2017-01-25T09:27:39Z

#219 includes a new 0.6.5 binary of sambamba with debug info. May be worth trying that since it was built with a recent ldc and llvm 3.7.

Maarten-vd-Sande · 2019-11-19T14:56:15Z

I just got this with version 0.7.0 in sambamba view. What is causing this? How can I solve this?

cviner · 2020-07-31T15:15:16Z

I also just got this in version 0.7.1, with sambamba view, when converting from SAM to BAM, across 20 threads.

charlottewright · 2021-11-16T17:49:12Z

I also have this issue when using version 0.8.1 while sorting a bam.

travc · 2021-11-17T04:53:32Z

@charlottewright The problem I was having was due to BAI indexing limitations and a non-human genome with a really big chromosome. If you have a chrom/contig > (2^29)-1 in length, that's you're problem. Using a CSI index will likely fix it. If you can't do that, split the offending chrom/contig into two.
If it isn't that problem, I've got no clue.

lomereiter closed this as completed Jun 19, 2016

lomereiter reopened this Jun 20, 2016

lomereiter added the markdup label Jul 16, 2016

pjotrp closed this as completed Feb 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sambamba-markdup: Read reference ID is out of range #224

sambamba-markdup: Read reference ID is out of range #224

travc commented Jun 17, 2016

lomereiter commented Jun 17, 2016

travc commented Jun 17, 2016

lomereiter commented Jun 17, 2016

lomereiter commented Jun 19, 2016

travc commented Jun 20, 2016

lomereiter commented Jun 20, 2016

travc commented Jun 20, 2016

lomereiter commented Jun 20, 2016

travc commented Jun 20, 2016 •

edited

Loading

pwwang commented Jun 28, 2016

RoanKanninga commented Aug 8, 2016 •

edited

Loading

lomereiter commented Aug 14, 2016

yifangt commented Jan 23, 2017 •

edited

Loading

sambrightman commented Jan 24, 2017

pjotrp commented Jan 25, 2017

Maarten-vd-Sande commented Nov 19, 2019

cviner commented Jul 31, 2020

charlottewright commented Nov 16, 2021

travc commented Nov 17, 2021

sambamba-markdup: Read reference ID is out of range #224

sambamba-markdup: Read reference ID is out of range #224

Comments

travc commented Jun 17, 2016

lomereiter commented Jun 17, 2016

travc commented Jun 17, 2016

lomereiter commented Jun 17, 2016

lomereiter commented Jun 19, 2016

travc commented Jun 20, 2016

lomereiter commented Jun 20, 2016

travc commented Jun 20, 2016

lomereiter commented Jun 20, 2016

travc commented Jun 20, 2016 • edited Loading

pwwang commented Jun 28, 2016

RoanKanninga commented Aug 8, 2016 • edited Loading

lomereiter commented Aug 14, 2016

yifangt commented Jan 23, 2017 • edited Loading

sambrightman commented Jan 24, 2017

pjotrp commented Jan 25, 2017

Maarten-vd-Sande commented Nov 19, 2019

cviner commented Jul 31, 2020

charlottewright commented Nov 16, 2021

travc commented Nov 17, 2021

travc commented Jun 20, 2016 •

edited

Loading

RoanKanninga commented Aug 8, 2016 •

edited

Loading

yifangt commented Jan 23, 2017 •

edited

Loading