Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sambamba-markdup: Read reference ID is out of range #224

Closed
travc opened this issue Jun 17, 2016 · 19 comments
Closed

sambamba-markdup: Read reference ID is out of range #224

travc opened this issue Jun 17, 2016 · 19 comments
Labels

Comments

@travc
Copy link

travc commented Jun 17, 2016

sambamba markdup is dying (sig 11 I think) with:
sambamba-markdup: Read reference ID is out of range

This only happens for some (one at the moment) bam file. Others work fine.

v0.6.1

command:
sambamba_v0.6.1 markdup -t 8 merged.bam merged_markdup.bam

complete output:

finding positions of the duplicate reads in the file...
  sorted 24525923 end pairs
     and 8975055 single ends (among them 0 unmatched pairs)
  collecting indices of duplicate reads...   done in 4852 ms
  found 3504671 duplicates
collected list of positions in 1 min 36 sec
marking duplicates...
sambamba-markdup: Read reference ID is out of range

The merge step was also done with sambamba. Mapping done with bwa mem.

It fails fairly quickly, but the input file it is failing on is fairly huge (5.1G). I can make it available to someone for testing (it isn't human data), but don't want to just post it.

The problem may have something to do with the reference (total size 1.3G). The 2nd scaffold (out of 3919) in the reference is 552137040 bp long, and some picard and GATK tools have been choking on it. 552137040 * 8 is > 2^32, unlike any of the human chroms... so maybe something there.

@lomereiter
Copy link
Contributor

Hi,

It sounds like a bug in merge tool. It would be helpful if you provided for each of inputs and the output of the merging step:

  • SAM header
  • dump of the binary section with reference sequence information (sambamba view --reference-info)
  • the reference name column (sambamba view | cut -f 3), it might segfault on the final file, that's fine.

@travc
Copy link
Author

travc commented Jun 17, 2016

I'll email you a link if that is ok...

@lomereiter
Copy link
Contributor

Thanks, I downloaded the files and can reproduce the issue.

@lomereiter
Copy link
Contributor

Appears to have the same cause as #214 (buggy version of lz4 library), finished just fine with v0.6.2.

@travc
Copy link
Author

travc commented Jun 20, 2016

Thanks. I'm busily doing workarounds for other broken tools, and was just assuming this was the same problem. I'll be sure to update before rerunning markdup.

@lomereiter
Copy link
Contributor

I reran it several times, and occasionally it still fails :( Reopening.

@lomereiter lomereiter reopened this Jun 20, 2016
@travc
Copy link
Author

travc commented Jun 20, 2016

Apparently there is a deep limitation in BAI indexing such that the max contig size cannot exceed (2^29)-1. One of the contigs in my ref violates that. Maybe this is what is tripping up sambamba.

The workaround I mentioned above is to split that contig (it is a chromosome, so I can split it into arms with not problem). However, that doesn't really solve the problem for others and the future.

If this really is the issue, everything is moving to CRAM or at least CSI indexes eventually may be a good enough long-term fix. For now though, prominently noting the limitation and catching the condition before it causes hard-to-track errors would probably be a good idea. Again, assuming that size limit is the problem.

@lomereiter
Copy link
Contributor

That's unlikely to be the cause, markdup doesn't do any index queries. It worst-case scenario it will write a broken .BAI index.

Might be related to #189

@travc
Copy link
Author

travc commented Jun 20, 2016

Ugh... Deadlocks. That is above my pay grade I'm afraid.
I might have seen it happen though... I'm running through a snakemake workflow, and just assumed the problem was at that level.
This dataset I'm working on does seem to be extra problematic for pretty much all the tools, not just sambamba.

Just tell me if there's anything particular you'd like me to look out for. I feel a bit guilty not digging into the code myself to help, but I'm not going to learn D at the moment.

@pwwang
Copy link

pwwang commented Jun 28, 2016

I also encountered this with v0.6.3
Full output:

finding positions of the duplicate reads in the file...
  sorted 422836935 end pairs
     and 4965899 single ends (among them 0 unmatched pairs)
  collecting indices of duplicate reads...   done in 100879 ms
  found 99439311 duplicates
collected list of positions in 62 min 3 sec
marking duplicates...
sambamba-markdup: Read reference ID is out of range

Some bam files work fine, but when index the marked bam file, error occurred:
sambamba-index: Error reading BGZF block starting from offset 134206492: stream error: not enough data in stream

But they all work fine with Picard MarkDuplicates.

@RoanKanninga
Copy link

RoanKanninga commented Aug 8, 2016

What's the status on this issue? We (Genetics department at UMCG) are heavily dependent on sambamba, upgrading to the latest version (0.6.3) did help for some samples/runs, but not all.
We're experiencing this in WES and WGS data.

We are using also using bwa mem and merge with sambamba..

@lomereiter
Copy link
Contributor

Hi @RoanKanninga, could you check if compiling from source fixes the issue? I also experience it with the release binaries, but can't reproduce it in the development environment. It may be that outdated LLVM on CentOS leads to bugs like this.

@yifangt
Copy link

yifangt commented Jan 23, 2017

I still have this problem with v0.6.5 binary.
Any update of the issue?
$ sambamba-merge: Read reference ID is out of range
Thanks!
Yifang

@sambrightman
Copy link
Collaborator

If #189 is suspected then it'd be a good to have @RoanKanninga's kernel version(s).

@pjotrp
Copy link
Member

pjotrp commented Jan 25, 2017

#219 includes a new 0.6.5 binary of sambamba with debug info. May be worth trying that since it was built with a recent ldc and llvm 3.7.

@pjotrp pjotrp closed this as completed Feb 24, 2017
@Maarten-vd-Sande
Copy link

I just got this with version 0.7.0 in sambamba view. What is causing this? How can I solve this?

@cviner
Copy link

cviner commented Jul 31, 2020

I also just got this in version 0.7.1, with sambamba view, when converting from SAM to BAM, across 20 threads.

@charlottewright
Copy link

I also have this issue when using version 0.8.1 while sorting a bam.

@travc
Copy link
Author

travc commented Nov 17, 2021

@charlottewright The problem I was having was due to BAI indexing limitations and a non-human genome with a really big chromosome. If you have a chrom/contig > (2^29)-1 in length, that's you're problem. Using a CSI index will likely fix it. If you can't do that, split the offending chrom/contig into two.
If it isn't that problem, I've got no clue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants