Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for random reading of vcf file. #180

Merged
merged 5 commits into from
Nov 13, 2019

Conversation

niyarin
Copy link
Contributor

@niyarin niyarin commented Oct 15, 2019

Summery

This PR adds support for random reading of bgzip compressed VCF files.
The tabix format was adjusted to bam index, and its reading function was integrated with bam-index to create io/util/bin.clj.

@niyarin niyarin requested a review from alumi as a code owner October 15, 2019 23:10
@niyarin niyarin requested a review from a team October 15, 2019 23:10
@ghost ghost requested review from athos and removed request for a team October 15, 2019 23:10
@codecov
Copy link

codecov bot commented Oct 15, 2019

Codecov Report

Merging #180 into master will decrease coverage by 0.44%.
The diff coverage is 93.1%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #180      +/-   ##
==========================================
- Coverage   86.92%   86.47%   -0.45%     
==========================================
  Files          74       75       +1     
  Lines        5765     5924     +159     
  Branches      490      497       +7     
==========================================
+ Hits         5011     5123     +112     
- Misses        264      304      +40     
- Partials      490      497       +7
Impacted Files Coverage Δ
src/cljam/io/bam_index/core.clj 78.94% <100%> (-4.69%) ⬇️
src/cljam/io/vcf.clj 95.74% <100%> (+0.87%) ⬆️
src/cljam/io/bam_index/reader.clj 97.7% <100%> (+0.02%) ⬆️
src/cljam/io/bam_index.clj 100% <100%> (ø) ⬆️
src/cljam/io/tabix.clj 100% <100%> (ø) ⬆️
src/cljam/io/bam_index/writer.clj 84.29% <66.66%> (-0.09%) ⬇️
src/cljam/io/vcf/reader.clj 85.71% <81.48%> (-10.96%) ⬇️
src/cljam/io/util/chunk.clj 75.6% <85.71%> (ø)
src/cljam/io/util/bin.clj 95.83% <95.83%> (ø)
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e94da96...a6d6692. Read the comment docs.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR! 👍
Could you realign your codes and remove trailing whitespaces?
git diff master --name-only | grep '\.clj$' | xargs lein update-in :plugins conj '[lein-cljfmt "0.6.4"]' -- cljfmt fix will help you fix that kind of problem in the patch.
For more details, please refer to https://guide.clojure.style.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments found by a linter eastwood.

Try something like this:

git diff master --name-only |
  grep '\.clj$' |
  sed -E 's/\//./g;s/_/-/g;s/\.clj$//g;s/^(src|test)\.//' |
  tr '\n' ' ' |
  xargs -I{} lein update-in :plugins conj '[jonase/eastwood "0.3.5"]' -- update-in :eastwood assoc :add-linters '[:unused-namespaces :unused-locals :unused-fn-args]' -- eastwood "{:namespaces [{}]}" 2>/dev/null |
  grep -v 'Reflection warning'

It will find out potential bugs in this patch. (Beware of false positives!)
You can omit lein update-in if you have these settings in ~/.lein/profiles.clj.

src/cljam/io/bam_index/core.clj Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
test/cljam/io/util/bin_test.clj Outdated Show resolved Hide resolved
@niyarin niyarin force-pushed the feature/vcf-reader-with-index branch from ac96557 to abb6a71 Compare October 16, 2019 07:13
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I've added some more comments.

src/cljam/io/tabix.clj Outdated Show resolved Hide resolved
src/cljam/io/tabix.clj Outdated Show resolved Hide resolved
src/cljam/io/tabix.clj Outdated Show resolved Hide resolved
src/cljam/io/tabix.clj Outdated Show resolved Hide resolved
src/cljam/io/tabix.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
test/cljam/io/vcf_test.clj Outdated Show resolved Hide resolved
test/cljam/test_common.clj Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @niyarin. Thank you for working on this feature 👍

I added some comments.
As a general suggestion: Don't type hint too defensively. Type hints are just for suppressing reflective invocation. If you add more type hints than necessary, the code would look somewhat clumsy, so keep them as few as possible. The Clojure compiler knows much where a type hint is necessary. lein check warns you if a required type hint is missing.

(lidx-ref [this]))

(defn get-spans
[^cljam.io.util.bin.IBinaryIndex index-data ^long ref-idx ^long beg ^long end]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IBinaryIndex is a protocol, which is a Clojure construct, so you don't have to type hint index-data.

(let [bins (reg->bins beg end)
bidx (get (bidx-ref index-data) ref-idx)
lidx (get (lidx-ref index-data) ref-idx)
chunks (into [] (comp (map bidx) cat) bins)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(comp (map bidx) cat) can be replaced with (mapcat bidx), though it's not part of the code you wrote 😅

(deftest about-read-index-returns-a-map
(is (map? (tbi/read-index test-tabix-file))))
(deftest about-read-index-returns-tabix-object
(is (instance? cljam.io.tabix.Tabix (tbi/read-index test-tabix-file))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you do (:import [cljam.io.tabix Tabix]) in the ns declaration, you don't have to spell out the fully qualified name, just refer to it as Tabix.

Comment on lines 26 to 34
(is (instance?
Chunk
(get
^clojure.lang.IPersistentVector
(get
^clojure.lang.IPersistentMap
(get
^clojure.lang.IPersistentMap
(.bidx tabix-data) 0) 4687) 0)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write these lines simply:

(is (instance? Chunk (get-in (.bidx tabix-data) [0 4687 0])))

^clojure.lang.IPersistentMap
(.bidx tabix-data) 0) 4687) 0)))
(is (vector?
^clojure.lang.IPersistentVector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type hint can be omitted.

Comment on lines 45 to 46
(are [x] ((partial instance? cljam.io.tabix.Tabix)
(tbi/read-index x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be (instance? Tabix (tbi/read-index x)).

@@ -0,0 +1,45 @@
(ns cljam.io.util.bin
(:refer-clojure :exclude [compare])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to exclude compare?

[^DataInputStream rdr]
{:beg (lsb/read-long rdr)
:end (lsb/read-long rdr)})
(defn- read-bin-index**!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what you meant by **! characters here 🤔
I'd rather name it read-bin-index instead.

n-chunk (lsb/read-int rdr)]
{:bin bin
:chunks (doall (map (fn [_] (read-chunk rdr)) (range n-chunk)))}))
(defn- read-linear-index**!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as read-bin-index.

@niyarin
Copy link
Contributor Author

niyarin commented Oct 25, 2019

Thank you for pointing.
I fixed them.

@alumi
Copy link
Member

alumi commented Oct 28, 2019

@niyarin Thanks for the update! 👍It seems like CI is failing. Could you take a look around here?

[cljam.io.tabix :as tabix]

@niyarin niyarin force-pushed the feature/vcf-reader-with-index branch from c6d4602 to a938984 Compare October 28, 2019 03:44
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my late reply. The code looks much better! I've added some more comments.

src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
test/cljam/io/util/bin_test.clj Outdated Show resolved Hide resolved
test/cljam/io/vcf_test.clj Outdated Show resolved Hide resolved
src/cljam/io/util/bin.clj Outdated Show resolved Hide resolved
src/cljam/io/util/bin.clj Outdated Show resolved Hide resolved
@niyarin
Copy link
Contributor Author

niyarin commented Nov 11, 2019

Thank you for pointing.
I fixed them.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'm sorry for bothering you many times but I've added some trivial comments.

src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/util/bin.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
@niyarin
Copy link
Contributor Author

niyarin commented Nov 11, 2019

Thank you for pointing.
I fixed them.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@alumi alumi requested a review from athos November 12, 2019 01:16
Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a few more minor comments.

@@ -178,3 +180,41 @@
:deep (vcf-util/variant-parser (.meta-info rdr) (.header rdr))
:vcf identity)]
(map parse-fn (read-data-lines (.reader rdr) (.header rdr) kws)))))

(defn- make-lazy-variants [f s]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make-lazy-variants looks roughly equivalent to mapcat. Could we replace this with it? Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think name binding is necessary for recursion of lazy sequences.
Is there a better expression?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for putting the comment to a confusing place 🙏

I meant I asked whether we could replace the invocation to make-lazy-variants (that appears at L205) with mapcat since they apparently behave in the same way:

(make-lazy-variants (juxt dec inc) [1 2 3]) ;=> (0 2 1 3 2 4)
(mapcat (juxt dec inc) [1 2 3]) ;=> (0 2 1 3 2 4)

Or do they actually have a difference from some kind of perspective?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I understand.

(make-lazy-variants f (rest s)))))

(defn read-variants-randomly
"Read variants of the bgzip compressed VCF file randomly using tabix file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Read variants of the bgzip compressed VCF file randomly using tabix file.
"Reads variants of the bgzip compressed VCF file randomly using tabix file.

(bit-shift-right (if (<= pos 0) 0 (dec pos)) linear-index-shift))

(defn get-spans
"Calculate span information for random access from ndex data such as tabix."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Calculate span information for random access from ndex data such as tabix."
"Calculate span information for random access from index data such as tabix."

(is (number? (.meta tabix-data)))
(is (number? (.skip tabix-data)))
(is (vector? (.seq tabix-data)))
(is (instance? Chunk (get (get (get (.bidx tabix-data) 0) 4687) 0)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(is (instance? Chunk (get (get (get (.bidx tabix-data) 0) 4687) 0)))
(is (instance? Chunk (get (get (get (.bidx tabix-data) 0) 4687) 0)))

(is (number? (.skip tabix-data)))
(is (vector? (.seq tabix-data)))
(is (instance? Chunk (get (get (get (.bidx tabix-data) 0) 4687) 0)))
(is (vector? (get (.lidx tabix-data) 0)))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(is (vector? (get (.lidx tabix-data) 0)))))
(is (vector? (get (.lidx tabix-data) 0)))))

Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 I will merge it.
And thanks again for patiently working on this feature! 💪

@athos athos merged commit 5d9ea2d into master Nov 13, 2019
@athos athos deleted the feature/vcf-reader-with-index branch November 13, 2019 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants