Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for CRAM index generation #317

Merged
merged 3 commits into from
Aug 2, 2024
Merged

Conversation

athos
Copy link
Member

@athos athos commented Jul 24, 2024

Feature overview

This PR adds support for generating CRAM index files during CRAM writing. Specifically, it provides the following two APIs:

  • Automatic index file generation during CRAM writing
    • By default, index file generation is disabled
    • To enable index file generation, use the :create-index? option when creating the CRAM writer
    (require '[cljam.io.cram :as cram])
    
    (with-open [w (cram/writer "path/to/cram/file" {:create-index? true})]
      ... )
  • CRAM Indexer, which generates a CRAM index file for an existing CRAM file
    (require '[cljam.algo.cram-indexer :as indexer])
    
    (indexer/create-index "path/to/cram/file" "path/to/crai/file")

By default, CRAM index files can only be generated for CRAM files sorted by coordinate. Trying to generate an index file for a CRAM file not sorted by coordinate will result in an error.

Whether a CRAM file is sorted by coordinate is determined by checking if the CRAM header is declared as SO:coordinate. To skip this check and generate an index file for CRAM files not declared as SO:coordinate, use the skip-sort-order-check? option:

(require '[cljam.io.cram :as cram])

(with-open [w (cram/writer "path/to/cram/file" {:create-index? true
                                                :skip-sort-order-check? true})]
  ... )
(require '[cljam.algo.cram-indexer :as indexer])

(indexer/create-index "path/to/cram/file" "path/to/crai/file" :skip-sort-order-check? true)

Details

A CRAM index file is a gzipped TSV file with each line containing the following fields (See the CRAM specification §12. Indexing for more details):

  • Reference sequence id (-1 for slices containing only unmapped records)
  • Alignment start
  • Alignment span
  • Absolute byte offset of the container header in the file
  • Relative byte offset of the slice header block
  • Slice size in bytes

Typically, one index entry is created per slice. However, for multiple reference slices, an entry is created for each reference to which records are mapped.

For non-multiple reference slices, the necessary information for creating index entries can be obtained solely from the container header and slice header. For multiple reference slices, it is necessary to scan the records within the slice to calculate the span for each reference in addition to the header information.

This PR introduces changes to both the CRAM writer (for generating index files during CRAM writing) and the CRAM reader (for the CRAM indexer). The responsibility of generating index entries is added to each of them. Additionally, the facility to scan slice records and calculate spans for each reference for multiple reference slices is newly added to the alignment stats.

@athos athos requested a review from a team July 24, 2024 02:30
@athos athos self-assigned this Jul 24, 2024
@athos athos requested review from matsutomo81 and removed request for a team July 24, 2024 02:30
@athos athos requested a review from alumi as a code owner July 24, 2024 02:30
@athos athos removed the request for review from alumi July 24, 2024 02:31
Copy link

codecov bot commented Jul 24, 2024

Codecov Report

Attention: Patch coverage is 98.12734% with 5 lines in your changes missing coverage. Please review.

Project coverage is 89.50%. Comparing base (121e544) to head (2d6747b).

Files Patch % Lines
src/cljam/io/cram/writer.clj 92.85% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #317      +/-   ##
==========================================
+ Coverage   89.28%   89.50%   +0.21%     
==========================================
  Files          97       98       +1     
  Lines        8841     9005     +164     
  Branches      481      481              
==========================================
+ Hits         7894     8060     +166     
+ Misses        466      464       -2     
  Partials      481      481              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@matsutomo81 matsutomo81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this. 🙏
It looks mostly fine, but I've added a few comments on points that I think could potentially be improved.

test/cljam/io/cram/encode/record_test.clj Outdated Show resolved Hide resolved
Comment on lines 73 to 78
records (object-array
[{:rname "ref", :pos 1, :cigar "5M", :seq "AGAAT", :qual "HFHHH"}
{:rname "ref", :pos 5, :cigar "2S3M",:seq "CCTGT", :qual "##AAC"}
{:rname "ref", :pos 10, :cigar "5M", :seq "GATAA", :qual "CCCFF"}
{:rname "ref", :pos 15, :cigar "1M1I1M1D2M", :seq "GAAAG", :qual "EBBFF"}
{:rname "*", :pos 0, :cigar "*", :seq "CTGTG", :qual "AEEEE"}])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be better to include data where the :seq is "*" to improve test coverage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test case for this in c2ea942.

Comment on lines 53 to 70
(deftest make-alignment-spans-builder-test
(let [builder (stats/make-alignment-spans-builder)
records [{:ri 0, :start 1, :end 100}
{:ri 0, :start 51, :end 150}
{:ri 1, :start 51, :end 150}
{:ri 1, :start 151, :end 300}
{:ri -1, :start 0, :end 0}
{:ri -1, :start 0, :end 0}]]
(run! (fn [{:keys [ri start end]}]
(stats/update-span! builder ri start end))
records)
(is (= {0 {:start 1, :span 150}
1 {:start 51, :span 250}
-1 {:start 0, :span 1}}
(stats/build-spans builder)))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be beneficial to include test data where there's a gap in the alignment range for the same reference sequence id. For example, in the context of this test, input like:

{:ri 2, :start 51, :end 150}
{:ri 2, :start 200, :end 300}

This would make it easier to verify the behavior of the function.

Additionally, I'm not confident about my understanding of the specification, but is it correct that gaps are ignored? For the input above, would the expected output be 2 {:start 51, :span 250} and is this the correct behavior according to the specification?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be beneficial to include test data where there's a gap in the alignment range for the same reference sequence id.

Will do. Thanks!

Additionally, I'm not confident about my understanding of the specification, but is it correct that gaps are ignored?

I believe so. Otherwise, an index entry would have to be created for each contiguous alignment span.
The CRAM specification says:

Multi-reference slices may need to have multiple lines for the same slice; one for each reference contained within that slice.

In other words, at most one index entry can be created for each reference, which is incompatible with the "one entry for each contiguous alignment span" assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the first point, c2ea942 just added a test case.

@athos athos requested a review from a team as a code owner August 2, 2024 03:04
@athos athos requested review from r6eve and removed request for a team August 2, 2024 03:04
Copy link
Contributor

@matsutomo81 matsutomo81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the additional commits!
LGTM 👍

@athos athos removed the request for review from r6eve August 2, 2024 04:58
Fix spacing

Co-authored-by: Matsutomo81 <[email protected]>

Add missing test cases
@athos athos force-pushed the feature/cram-index-generation branch from c2ea942 to 2d6747b Compare August 2, 2024 05:06
@athos
Copy link
Member Author

athos commented Aug 2, 2024

Thank you for reviewing! I've squashed the additional two commits into the third commit.

@matsutomo81 matsutomo81 merged commit 427356d into master Aug 2, 2024
17 checks passed
@matsutomo81 matsutomo81 deleted the feature/cram-index-generation branch August 2, 2024 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants