-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for random reading of vcf file. #180
Conversation
Codecov Report
@@ Coverage Diff @@
## master #180 +/- ##
==========================================
- Coverage 86.92% 86.47% -0.45%
==========================================
Files 74 75 +1
Lines 5765 5924 +159
Branches 490 497 +7
==========================================
+ Hits 5011 5123 +112
- Misses 264 304 +40
- Partials 490 497 +7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR! 👍
Could you realign your codes and remove trailing whitespaces?
git diff master --name-only | grep '\.clj$' | xargs lein update-in :plugins conj '[lein-cljfmt "0.6.4"]' -- cljfmt fix
will help you fix that kind of problem in the patch.
For more details, please refer to https://guide.clojure.style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments found by a linter eastwood
.
Try something like this:
git diff master --name-only |
grep '\.clj$' |
sed -E 's/\//./g;s/_/-/g;s/\.clj$//g;s/^(src|test)\.//' |
tr '\n' ' ' |
xargs -I{} lein update-in :plugins conj '[jonase/eastwood "0.3.5"]' -- update-in :eastwood assoc :add-linters '[:unused-namespaces :unused-locals :unused-fn-args]' -- eastwood "{:namespaces [{}]}" 2>/dev/null |
grep -v 'Reflection warning'
It will find out potential bugs in this patch. (Beware of false positives!)
You can omit lein update-in
if you have these settings in ~/.lein/profiles.clj
.
ac96557
to
abb6a71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! I've added some more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @niyarin. Thank you for working on this feature 👍
I added some comments.
As a general suggestion: Don't type hint too defensively. Type hints are just for suppressing reflective invocation. If you add more type hints than necessary, the code would look somewhat clumsy, so keep them as few as possible. The Clojure compiler knows much where a type hint is necessary. lein check
warns you if a required type hint is missing.
src/cljam/io/util/bin.clj
Outdated
(lidx-ref [this])) | ||
|
||
(defn get-spans | ||
[^cljam.io.util.bin.IBinaryIndex index-data ^long ref-idx ^long beg ^long end] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IBinaryIndex
is a protocol, which is a Clojure construct, so you don't have to type hint index-data
.
src/cljam/io/util/bin.clj
Outdated
(let [bins (reg->bins beg end) | ||
bidx (get (bidx-ref index-data) ref-idx) | ||
lidx (get (lidx-ref index-data) ref-idx) | ||
chunks (into [] (comp (map bidx) cat) bins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(comp (map bidx) cat)
can be replaced with (mapcat bidx)
, though it's not part of the code you wrote 😅
test/cljam/io/tabix_test.clj
Outdated
(deftest about-read-index-returns-a-map | ||
(is (map? (tbi/read-index test-tabix-file)))) | ||
(deftest about-read-index-returns-tabix-object | ||
(is (instance? cljam.io.tabix.Tabix (tbi/read-index test-tabix-file)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you do (:import [cljam.io.tabix Tabix])
in the ns declaration, you don't have to spell out the fully qualified name, just refer to it as Tabix
.
test/cljam/io/tabix_test.clj
Outdated
(is (instance? | ||
Chunk | ||
(get | ||
^clojure.lang.IPersistentVector | ||
(get | ||
^clojure.lang.IPersistentMap | ||
(get | ||
^clojure.lang.IPersistentMap | ||
(.bidx tabix-data) 0) 4687) 0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can write these lines simply:
(is (instance? Chunk (get-in (.bidx tabix-data) [0 4687 0])))
test/cljam/io/tabix_test.clj
Outdated
^clojure.lang.IPersistentMap | ||
(.bidx tabix-data) 0) 4687) 0))) | ||
(is (vector? | ||
^clojure.lang.IPersistentVector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type hint can be omitted.
test/cljam/io/tabix_test.clj
Outdated
(are [x] ((partial instance? cljam.io.tabix.Tabix) | ||
(tbi/read-index x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be (instance? Tabix (tbi/read-index x))
.
src/cljam/io/util/bin.clj
Outdated
@@ -0,0 +1,45 @@ | |||
(ns cljam.io.util.bin | |||
(:refer-clojure :exclude [compare]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to exclude compare
?
src/cljam/io/tabix.clj
Outdated
[^DataInputStream rdr] | ||
{:beg (lsb/read-long rdr) | ||
:end (lsb/read-long rdr)}) | ||
(defn- read-bin-index**! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what you meant by **!
characters here 🤔
I'd rather name it read-bin-index
instead.
src/cljam/io/tabix.clj
Outdated
n-chunk (lsb/read-int rdr)] | ||
{:bin bin | ||
:chunks (doall (map (fn [_] (read-chunk rdr)) (range n-chunk)))})) | ||
(defn- read-linear-index**! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as read-bin-index
.
Thank you for pointing. |
@niyarin Thanks for the update! 👍It seems like CI is failing. Could you take a look around here? Line 7 in c6d4602
|
c6d4602
to
a938984
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my late reply. The code looks much better! I've added some more comments.
Thank you for pointing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I'm sorry for bothering you many times but I've added some trivial comments.
Thank you for pointing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a few more minor comments.
src/cljam/io/vcf/reader.clj
Outdated
@@ -178,3 +180,41 @@ | |||
:deep (vcf-util/variant-parser (.meta-info rdr) (.header rdr)) | |||
:vcf identity)] | |||
(map parse-fn (read-data-lines (.reader rdr) (.header rdr) kws))))) | |||
|
|||
(defn- make-lazy-variants [f s] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make-lazy-variants
looks roughly equivalent to mapcat
. Could we replace this with it? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think name binding is necessary for recursion of lazy sequences.
Is there a better expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for putting the comment to a confusing place 🙏
I meant I asked whether we could replace the invocation to make-lazy-variants
(that appears at L205) with mapcat
since they apparently behave in the same way:
(make-lazy-variants (juxt dec inc) [1 2 3]) ;=> (0 2 1 3 2 4)
(mapcat (juxt dec inc) [1 2 3]) ;=> (0 2 1 3 2 4)
Or do they actually have a difference from some kind of perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I understand.
src/cljam/io/vcf/reader.clj
Outdated
(make-lazy-variants f (rest s))))) | ||
|
||
(defn read-variants-randomly | ||
"Read variants of the bgzip compressed VCF file randomly using tabix file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Read variants of the bgzip compressed VCF file randomly using tabix file. | |
"Reads variants of the bgzip compressed VCF file randomly using tabix file. |
src/cljam/io/util/bin.clj
Outdated
(bit-shift-right (if (<= pos 0) 0 (dec pos)) linear-index-shift)) | ||
|
||
(defn get-spans | ||
"Calculate span information for random access from ndex data such as tabix." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Calculate span information for random access from ndex data such as tabix." | |
"Calculate span information for random access from index data such as tabix." |
test/cljam/io/tabix_test.clj
Outdated
(is (number? (.meta tabix-data))) | ||
(is (number? (.skip tabix-data))) | ||
(is (vector? (.seq tabix-data))) | ||
(is (instance? Chunk (get (get (get (.bidx tabix-data) 0) 4687) 0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(is (instance? Chunk (get (get (get (.bidx tabix-data) 0) 4687) 0))) | |
(is (instance? Chunk (get (get (get (.bidx tabix-data) 0) 4687) 0))) |
test/cljam/io/tabix_test.clj
Outdated
(is (number? (.skip tabix-data))) | ||
(is (vector? (.seq tabix-data))) | ||
(is (instance? Chunk (get (get (get (.bidx tabix-data) 0) 4687) 0))) | ||
(is (vector? (get (.lidx tabix-data) 0))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(is (vector? (get (.lidx tabix-data) 0))))) | |
(is (vector? (get (.lidx tabix-data) 0))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 I will merge it.
And thanks again for patiently working on this feature! 💪
Summery
This PR adds support for random reading of bgzip compressed VCF files.
The tabix format was adjusted to bam index, and its reading function was integrated with bam-index to create io/util/bin.clj.