-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for 10x bam files #12
Comments
htslib could be used, or we could parse BAM files directly to avoid portability issues accompanying linking against htslib. The simplest thing would be a preprocessing script or executable which took a BAM and wrote a concatenation of all barcodes to individual fastx records before being provided, though this would be rather slow and just a stopgap measure. After looking over the sourmash modifications, I think something like the following would likely work: (Untested)
We do think this would be a great use case, and will be considering working on native support, as it would be much faster and avoid a lot of temporary files. You'll probably want to use a rather small sketch size (p = 8 or less (use flag |
That was my initial solution, as well, but creating a dictionary for each cell uses 300GiB+ of memory for bam files with >1000 cells, which is unreasonable. Also, there are There's another PR which sorts the alignment file by barcode first, which is has a much smaller memory footprint.
Yes, the RNA-seq profiles probably have ~100k-500k unique k-mers so smaller sketches should be fine. |
A barcode-based project of mine for ctDNA from years ago used barcode binning by prefix into separate files, followed by hashmap-based collapsing. Sorting by barcode first was my initial method for this project, but I found that I could go from a week or more to demultiplex large experiments to a half hour with that full rewrite. Sorting gets very expensive for billions of reads. I'm not sure I understand -- will you be comparing pairwise distances between linked reads or between sequencing experiments? If you want to compare whole RNAseq experiments, you should just use a bam2fq conversion tool (or avoid aligning to begin with) and then use If you want to do all comparisons between sets of reads containing individual barcodes, the simple answer is a hashmap from barcode to a HyperLogLog sketch. The barcodes would take 4 bytes each (since 16-base pair barcodes can be encoded in 32 bits). A draft of this section would look like something like:
This barcode encoding is similar to bonsai, but there's no reverse complementing because the barcode is the barcode. It's not parallelized; if the bam was already indexed, then a map/reduce strategy by seeking to separate portions of the file would be useful, but something like this wouldn't hurt for a first pass of the parse, and most time would probably be in the quadratic comparison phase. Yes, bonsai ignores all kmers containing Ns. Most tools do (e.g., Kraken{1,2}, their competitors). This allows us to use 2 bits per character and therefore fit 31 nucleotides in a 64-bit int, using UINT64_C(-1) as an error code. (You can still use 32, but you won't accurately identify any sequences consisting of 32 Ts correctly. Hopefully that doesn't actually matter.) I'd have more time to work on this, but we're in the final stages of preparation before submission, so things requiring time will need to wait a bit. |
Huh wow, I didn't realize sorting could be so bad!
Sorry I was unclear. The goal is to compare barcodes both across and within RNA-seq experiments, e.g. cellX from experimentA vs cellY from experimentB, and cellX_1, cellX_2, ... cellX_n from experimentA. So the first option, all pairwise distances between reads associated to a barcode. Using the 10x bam file instead of starting from the raw fastq enables apples:apples comparison of Jaccard similarities vs pairwise distances of the processed count matrix. Forgot to say before: The conversion of the bam to a bunch of concatenated reads separated by an That C++ code is a little over my head with the bit shifting, and I don't think I could write the
No worries. I think I have enough to go off of here. Good luck with the submission! |
It's not just sorting, it's also more efficient encoding of sequence and a more easily parallelized algorithm. We do plan to implement an efficient method for working on this soon. I think it's ideally suited to the problem, as the memory requirement ends up just being on the order of |
A draft implementation (untested) is now available at https://github.com/dnbaker/10xdash. There’s likely some debugging to do; if you run into problems, if you provide some sample files, I’ll be better able to fix them. |
Thank you! I'll check this out next week.
---
Olga Botvinnik, PhD
olgabotvinnik.com <http://www.olgabotvinnik.com>
…On Sat, Jan 26, 2019 at 7:04 AM Daniel Baker ***@***.***> wrote:
A draft implementation (untested) is now available at
https://github.com/dnbaker/10xdash.
There’s likely some debugging to do; if you run into problems, if you
provide some sample files, I’ll be better able to fix them.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#12 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAxNcEw-2bbWG6aPY-KPuOq0YhqOu4yQks5vHG56gaJpZM4aPJ3Z>
.
|
Great! No rush on this, but also, are these bams aligned/sorted? If they are, there are additional opportunities for parallelism, though currently it just goes through the file linearly and parallizes JI calculations. Enjoy your weekend! |
We're continuing to develop this work at https://github.com/dnbaker/10xdash. Feel free to ask further questions here or there, but for now, I'm closing this issue. |
Hello,
For computing signatures of single-cell RNA-seq data, it's very convenient to use the 10x bam file directly. This was implemented in sourmash using the Python package bamnostic. For C/C++, I presume the htslib library would be used directly. It's been a while since I stretched my C muscles but I could give it a shot.
Warmest,
Olga
The text was updated successfully, but these errors were encountered: