-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for CRAM index generation #317
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #317 +/- ##
==========================================
+ Coverage 89.28% 89.50% +0.21%
==========================================
Files 97 98 +1
Lines 8841 9005 +164
Branches 481 481
==========================================
+ Hits 7894 8060 +166
+ Misses 466 464 -2
Partials 481 481 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work on this. 🙏
It looks mostly fine, but I've added a few comments on points that I think could potentially be improved.
records (object-array | ||
[{:rname "ref", :pos 1, :cigar "5M", :seq "AGAAT", :qual "HFHHH"} | ||
{:rname "ref", :pos 5, :cigar "2S3M",:seq "CCTGT", :qual "##AAC"} | ||
{:rname "ref", :pos 10, :cigar "5M", :seq "GATAA", :qual "CCCFF"} | ||
{:rname "ref", :pos 15, :cigar "1M1I1M1D2M", :seq "GAAAG", :qual "EBBFF"} | ||
{:rname "*", :pos 0, :cigar "*", :seq "CTGTG", :qual "AEEEE"}])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be better to include data where the :seq
is "*" to improve test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test case for this in c2ea942.
(deftest make-alignment-spans-builder-test | ||
(let [builder (stats/make-alignment-spans-builder) | ||
records [{:ri 0, :start 1, :end 100} | ||
{:ri 0, :start 51, :end 150} | ||
{:ri 1, :start 51, :end 150} | ||
{:ri 1, :start 151, :end 300} | ||
{:ri -1, :start 0, :end 0} | ||
{:ri -1, :start 0, :end 0}]] | ||
(run! (fn [{:keys [ri start end]}] | ||
(stats/update-span! builder ri start end)) | ||
records) | ||
(is (= {0 {:start 1, :span 150} | ||
1 {:start 51, :span 250} | ||
-1 {:start 0, :span 1}} | ||
(stats/build-spans builder))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be beneficial to include test data where there's a gap in the alignment range for the same reference sequence id. For example, in the context of this test, input like:
{:ri 2, :start 51, :end 150}
{:ri 2, :start 200, :end 300}
This would make it easier to verify the behavior of the function.
Additionally, I'm not confident about my understanding of the specification, but is it correct that gaps are ignored? For the input above, would the expected output be 2 {:start 51, :span 250}
and is this the correct behavior according to the specification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be beneficial to include test data where there's a gap in the alignment range for the same reference sequence id.
Will do. Thanks!
Additionally, I'm not confident about my understanding of the specification, but is it correct that gaps are ignored?
I believe so. Otherwise, an index entry would have to be created for each contiguous alignment span.
The CRAM specification says:
Multi-reference slices may need to have multiple lines for the same slice; one for each reference contained within that slice.
In other words, at most one index entry can be created for each reference, which is incompatible with the "one entry for each contiguous alignment span" assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the first point, c2ea942 just added a test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the additional commits!
LGTM 👍
Fix spacing Co-authored-by: Matsutomo81 <[email protected]> Add missing test cases
c2ea942
to
2d6747b
Compare
Thank you for reviewing! I've squashed the additional two commits into the third commit. |
Feature overview
This PR adds support for generating CRAM index files during CRAM writing. Specifically, it provides the following two APIs:
:create-index?
option when creating the CRAM writerBy default, CRAM index files can only be generated for CRAM files sorted by coordinate. Trying to generate an index file for a CRAM file not sorted by coordinate will result in an error.
Whether a CRAM file is sorted by coordinate is determined by checking if the CRAM header is declared as
SO:coordinate
. To skip this check and generate an index file for CRAM files not declared asSO:coordinate
, use theskip-sort-order-check?
option:Details
A CRAM index file is a gzipped TSV file with each line containing the following fields (See the CRAM specification §12. Indexing for more details):
Typically, one index entry is created per slice. However, for multiple reference slices, an entry is created for each reference to which records are mapped.
For non-multiple reference slices, the necessary information for creating index entries can be obtained solely from the container header and slice header. For multiple reference slices, it is necessary to scan the records within the slice to calculate the span for each reference in addition to the header information.
This PR introduces changes to both the CRAM writer (for generating index files during CRAM writing) and the CRAM reader (for the CRAM indexer). The responsibility of generating index entries is added to each of them. Additionally, the facility to scan slice records and calculate spans for each reference for multiple reference slices is newly added to the alignment stats.