Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still cannot read 10x bam file #16

Closed
olgabot opened this issue Sep 13, 2018 · 6 comments
Closed

Still cannot read 10x bam file #16

olgabot opened this issue Sep 13, 2018 · 6 comments
Assignees

Comments

@olgabot
Copy link
Contributor

olgabot commented Sep 13, 2018

Hello, thank you so much for fixing the previous error with 10x genomics! Unfortunately I'm still running into a buffering bug while reading the alignments.

Describe the bug
Cannot use bamnostic to read single-cell chromium bam file produced by 10x genomics.

To Reproduce
Steps to reproduce the behavior:
Same code as here: #15 (comment)

Expected behavior

Expected to be able to iterate over each read without error.

Screenshots
If applicable, add screenshots to help explain your problem.

Traceback (most recent call last):
  File "/anaconda3/envs/sourmash/bin/sourmash", line 11, in <module>
    load_entry_point('sourmash', 'console_scripts', 'sourmash')()
  File "/Users/olgabot/code/sourmash/sourmash/__main__.py", line 77, in main
    cmd(sys.argv[2:])
  File "/Users/olgabot/code/sourmash/sourmash/commands.py", line 276, in compute
    pool.map(lambda x: maybe_add_alignment(x, cell_seqs, args, barcodes), bam_file)
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/multiprocess/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/multiprocess/pool.py", line 343, in _map_async
    iterable = list(iterable)
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/bgzf.py", line 1410, in __next__
    read = bamnostic.AlignedSegment(self)
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/core.py", line 170, in __init__
    block_size = unpack_int32(self._io.read(4))[0]
struct.error: unpack requires a buffer of 4 bytes

Desktop (please complete the following information):

  • OS: macOS 10.12.6
  • Python Version: Python 3.6.5 :: Anaconda, Inc.
  • bamnostic Version: 0.8.13, latest commit (125e7c6)

Additional context

NA

@betteridiot
Copy link
Owner

betteridiot commented Sep 14, 2018

I see your problem. My knee-jerk reaction is to think that mapping asynchronously is causing the file cursor to come out alignment. Depending on the function you are using to the pull the reads, you may need to use the multiple_iterators parameter (if you are using fetch). Can you link me the code to what is calling _map_async?

@betteridiot
Copy link
Owner

Okay, my first reaction was wrong. After looking at it, it is a fault in how I coded serial access. It is faulting because it is reaching the end of file...I just never coded in EOF marker checking in serial access. I should be able to fix that tomorrow. Sorry for the delay.

@betteridiot
Copy link
Owner

fe1326c fixes the issue you were experiencing. The commit 8ba2388 on 'devel' branch explains the bugfix. Hope this helps, and I apologize if this has hindered your analysis in a significant way.

@olgabot
Copy link
Contributor Author

olgabot commented Sep 14, 2018

Thank you so much! This works perfectly now.

One tip you can advertise about bamnostic: since it's pure Python, it can be pickled! I was hesitant to use bamnostic over pysam at first because pysam is faster, but since the pysam.AlignedSegment objects can't be pickled, the file can't be read using multiprocessing and thus you're stuck using a single thread.

@olgabot
Copy link
Contributor Author

olgabot commented Sep 14, 2018

Would you be able to do a version bump soon? That way the PR I'm submitting to dib-lab/sourmash uses a pip version rather than a commit hash :)

@betteridiot
Copy link
Owner

That was exactly one of my initial issues with pysam. As great as it is, I wanted to do some multiprocessing. Furthermore, pysam's AlignedSegment hash method used to only hash on position and something else (I forget). Generally speaking, this isn't a problem. However, if you are doing targeted enrichment, many reads will stack up at the same position and hashes no longer become distinct.

So, part of making bamnostic pickle-able is due to my hash method--which takes into account the read name as well, ensuring distinct hashes.

Also, v0.9.0 is available on PyPI and GitHub. Conda-forge has to to through some CI before the maintainers accept it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants