Support for CRAM index generation #317

athos · 2024-07-24T02:30:47Z

Feature overview

This PR adds support for generating CRAM index files during CRAM writing. Specifically, it provides the following two APIs:

Automatic index file generation during CRAM writing
- By default, index file generation is disabled
- To enable index file generation, use the :create-index? option when creating the CRAM writer
```
(require '[cljam.io.cram :as cram])

(with-open [w (cram/writer "path/to/cram/file" {:create-index? true})]
  ... )
```

CRAM Indexer, which generates a CRAM index file for an existing CRAM file

(require '[cljam.algo.cram-indexer :as indexer])

(indexer/create-index "path/to/cram/file" "path/to/crai/file")

By default, CRAM index files can only be generated for CRAM files sorted by coordinate. Trying to generate an index file for a CRAM file not sorted by coordinate will result in an error.

Whether a CRAM file is sorted by coordinate is determined by checking if the CRAM header is declared as SO:coordinate. To skip this check and generate an index file for CRAM files not declared as SO:coordinate, use the skip-sort-order-check? option:

(require '[cljam.io.cram :as cram])

(with-open [w (cram/writer "path/to/cram/file" {:create-index? true
                                                :skip-sort-order-check? true})]
  ... )

(require '[cljam.algo.cram-indexer :as indexer])

(indexer/create-index "path/to/cram/file" "path/to/crai/file" :skip-sort-order-check? true)

Details

A CRAM index file is a gzipped TSV file with each line containing the following fields (See the CRAM specification §12. Indexing for more details):

Reference sequence id (-1 for slices containing only unmapped records)
Alignment start
Alignment span
Absolute byte offset of the container header in the file
Relative byte offset of the slice header block
Slice size in bytes

Typically, one index entry is created per slice. However, for multiple reference slices, an entry is created for each reference to which records are mapped.

For non-multiple reference slices, the necessary information for creating index entries can be obtained solely from the container header and slice header. For multiple reference slices, it is necessary to scan the records within the slice to calculate the span for each reference in addition to the header information.

This PR introduces changes to both the CRAM writer (for generating index files during CRAM writing) and the CRAM reader (for the CRAM indexer). The responsibility of generating index entries is added to each of them. Additionally, the facility to scan slice records and calculate spans for each reference for multiple reference slices is newly added to the alignment stats.

codecov · 2024-07-24T02:34:44Z

Codecov Report

Attention: Patch coverage is 98.12734% with 5 lines in your changes missing coverage. Please review.

Project coverage is 89.50%. Comparing base (121e544) to head (2d6747b).

Files	Patch %	Lines
src/cljam/io/cram/writer.clj	92.85%	4 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #317      +/-   ##
==========================================
+ Coverage   89.28%   89.50%   +0.21%     
==========================================
  Files          97       98       +1     
  Lines        8841     9005     +164     
  Branches      481      481              
==========================================
+ Hits         7894     8060     +166     
+ Misses        466      464       -2     
  Partials      481      481

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

matsutomo81

Thank you for your work on this. 🙏
It looks mostly fine, but I've added a few comments on points that I think could potentially be improved.

test/cljam/io/cram/encode/record_test.clj

matsutomo81 · 2024-08-01T12:38:35Z

test/cljam/io/cram/encode/record_test.clj

+        records (object-array
+                 [{:rname "ref", :pos 1, :cigar "5M", :seq "AGAAT", :qual "HFHHH"}
+                  {:rname "ref", :pos 5, :cigar "2S3M",:seq "CCTGT", :qual "##AAC"}
+                  {:rname "ref", :pos 10, :cigar "5M", :seq "GATAA", :qual "CCCFF"}
+                  {:rname "ref", :pos 15, :cigar "1M1I1M1D2M", :seq "GAAAG", :qual "EBBFF"}
+                  {:rname "*", :pos 0, :cigar "*", :seq "CTGTG", :qual "AEEEE"}])]


I thought it would be better to include data where the :seq is "*" to improve test coverage.

Added a test case for this in c2ea942.

matsutomo81 · 2024-08-01T12:57:32Z

test/cljam/io/cram/encode/alignment_stats_test.clj

+(deftest make-alignment-spans-builder-test
+  (let [builder (stats/make-alignment-spans-builder)
+        records [{:ri 0, :start 1, :end 100}
+                 {:ri 0, :start 51, :end 150}
+                 {:ri 1, :start 51, :end 150}
+                 {:ri 1, :start 151, :end 300}
+                 {:ri -1, :start 0, :end 0}
+                 {:ri -1, :start 0, :end 0}]]
+    (run! (fn [{:keys [ri start end]}]
+            (stats/update-span! builder ri start end))
+          records)
+    (is (= {0 {:start 1, :span 150}
+            1 {:start 51, :span 250}
+            -1 {:start 0, :span 1}}
+           (stats/build-spans builder)))))


I think it would be beneficial to include test data where there's a gap in the alignment range for the same reference sequence id. For example, in the context of this test, input like:

{:ri 2, :start 51, :end 150} {:ri 2, :start 200, :end 300}

This would make it easier to verify the behavior of the function.

Additionally, I'm not confident about my understanding of the specification, but is it correct that gaps are ignored? For the input above, would the expected output be 2 {:start 51, :span 250} and is this the correct behavior according to the specification?

I think it would be beneficial to include test data where there's a gap in the alignment range for the same reference sequence id.

Will do. Thanks!

Additionally, I'm not confident about my understanding of the specification, but is it correct that gaps are ignored?

I believe so. Otherwise, an index entry would have to be created for each contiguous alignment span.
The CRAM specification says:

Multi-reference slices may need to have multiple lines for the same slice; one for each reference contained within that slice.

In other words, at most one index entry can be created for each reference, which is incompatible with the "one entry for each contiguous alignment span" assumption.

As for the first point, c2ea942 just added a test case.

matsutomo81

Thank you for the additional commits!
LGTM 👍

Fix spacing Co-authored-by: Matsutomo81 <[email protected]> Add missing test cases

athos · 2024-08-02T05:08:17Z

Thank you for reviewing! I've squashed the additional two commits into the third commit.

athos added 2 commits July 22, 2024 18:20

Create CRAM index file in course of writing CRAM file

3622c21

Implement CRAM indexer

2c7d255

athos requested a review from a team July 24, 2024 02:30

athos self-assigned this Jul 24, 2024

athos requested review from matsutomo81 and removed request for a team July 24, 2024 02:30

athos requested a review from alumi as a code owner July 24, 2024 02:30

athos removed the request for review from alumi July 24, 2024 02:31

athos assigned matsutomo81 Jul 24, 2024

matsutomo81 reviewed Aug 1, 2024

View reviewed changes

athos requested a review from a team as a code owner August 2, 2024 03:04

athos requested review from r6eve and removed request for a team August 2, 2024 03:04

matsutomo81 approved these changes Aug 2, 2024

View reviewed changes

athos removed the request for review from r6eve August 2, 2024 04:58

Add tests for CRAM index generation

2d6747b

Fix spacing Co-authored-by: Matsutomo81 <[email protected]> Add missing test cases

athos force-pushed the feature/cram-index-generation branch from c2ea942 to 2d6747b Compare August 2, 2024 05:06

matsutomo81 merged commit 427356d into master Aug 2, 2024
17 checks passed

matsutomo81 deleted the feature/cram-index-generation branch August 2, 2024 06:11

athos mentioned this pull request Oct 10, 2024

Support reference embedding for CRAM writer #324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for CRAM index generation #317

Support for CRAM index generation #317

athos commented Jul 24, 2024

codecov bot commented Jul 24, 2024 •

edited

Loading

matsutomo81 left a comment •

edited

Loading

matsutomo81 Aug 1, 2024

athos Aug 2, 2024

matsutomo81 Aug 1, 2024

athos Aug 2, 2024

athos Aug 2, 2024

matsutomo81 left a comment

athos commented Aug 2, 2024

Support for CRAM index generation #317

Support for CRAM index generation #317

Conversation

athos commented Jul 24, 2024

Feature overview

Details

codecov bot commented Jul 24, 2024 • edited Loading

Codecov Report

matsutomo81 left a comment • edited Loading

Choose a reason for hiding this comment

matsutomo81 Aug 1, 2024

Choose a reason for hiding this comment

athos Aug 2, 2024

Choose a reason for hiding this comment

matsutomo81 Aug 1, 2024

Choose a reason for hiding this comment

athos Aug 2, 2024

Choose a reason for hiding this comment

athos Aug 2, 2024

Choose a reason for hiding this comment

matsutomo81 left a comment

Choose a reason for hiding this comment

athos commented Aug 2, 2024

codecov bot commented Jul 24, 2024 •

edited

Loading

matsutomo81 left a comment •

edited

Loading