-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reading CRAM files with embedded references #307
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #307 +/- ##
==========================================
- Coverage 88.50% 88.40% -0.11%
==========================================
Files 95 95
Lines 8214 8228 +14
Branches 506 506
==========================================
+ Hits 7270 7274 +4
- Misses 438 448 +10
Partials 506 506 ☔ View full report in Codecov by Sentry. |
4795708
to
4fdc68b
Compare
db98dac
to
21bd22e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this feature! I added some comments. Please take a look 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating! LGTM 👍
01eaab0
to
f1837a8
Compare
Rebased onto the latest master branch. |
f1837a8
to
7d8671b
Compare
Rebased onto the latest master branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this update. LGTM 👍
I haven't merged it because I thought you might want to handle the commit integration.
7d8671b
to
fae8525
Compare
Thank you both for reviewing! I've squashed the commits and merged the change. |
Note: This PR will be marked as ready for review after #306 has been merged.This PR allows the CRAM reader to handle files with embedded references. With this update, we no longer need to specify any reference file to read CRAM files with embedded references.
Specification
According to the CRAM specification, if the encoder embeds a portion of the reference sequence into a slice, it allocates a dedicated block to store the raw (ASCII character) base sequence data for the reference sequence covering that slice. The content ID of this allocated block is pointed to by the "embedded reference bases block content ID" field in the slice header. If there is no embedded reference for the slice, this field's value will be
-1
. (See §8.5 for more details)Implementation
In this change, the CRAM reader first checks the "embedded reference bases block content ID" field in the slice header to identify the block allocated for the embedded reference. If this value is positive, it creates a dedicated seq resolver based on the content of the block associated with the specified content ID. This new seq resolver is responsible for resolving reference sequences covered by the embedded reference and helps with the decoding of CRAM records in the slice.
Notes
This PR also makes some slight changes to the signature of the seq resolver protocol method to make the new seq resolver for embedded reference a little bit more efficient.
Note that this PR does not include additional test code, although you can easily verify the change using a simple snippet with a test file
0600_mapped.cram
from hts-specs (which is documented as being encoded with embedded references) as follows: