-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replaced IntervalsSkipList with OverlapDetector #4154
Conversation
@davidbenjamin Have you actually profiled this change on a Spark/dataproc cluster? |
@droazen Adam did profiling that showed that overlaps detector was strictly better. We also had some suspicion that there was some bug lurking in the skip list implementation because of weird non-deterministic results of something that relied on it. |
@lbergelson Adam did the profiling? That must have been years ago. Someone definitely needs to profile this change on a Spark cluster using the latest master to see if his ancient findings still hold up. |
@lbergelson @droazen I can do it but I would need a lot of hand-holding for both the profiling and the Spark. |
We're happy to help -- I think it's worth doing even if this is a no-brainer substitution, as it's easy to break Spark functionality on a cluster by silly things like introducing a non-serializable class in the wrong place. |
@droazen When we left off you showed me how to use JProfiler (which I now use regularly) and had pointed me to instructions for running GATK on a Spark cluster. I realize I'm going to need another round of hand-holding, because I'm not sure which Spark cluster to test on, and I'm not sure which command to test. And beyond measuring total runtime and looking for any new hotspots in JProfiler I don't know what else to measure. Could I get some more help from someone on the engine team? |
@jamesemery Can you make a decision on what to do with this one after we merge #5292 ? |
e78fa7c
to
456acd1
Compare
Codecov Report
@@ Coverage Diff @@
## master #4154 +/- ##
===============================================
+ Coverage 87.037% 87.043% +0.006%
+ Complexity 31728 31682 -46
===============================================
Files 1943 1938 -5
Lines 146193 146014 -179
Branches 16141 16117 -24
===============================================
- Hits 127242 127095 -147
+ Misses 13064 13035 -29
+ Partials 5887 5884 -3
|
456acd1
to
b012e4b
Compare
This is easily resolvable without performance profiling since the only remaining use of the |
@davidbenjamin I've stolen this branch from you. I hope you don't mind. |
@lbergelson No objection. It's in better hands than mine now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve this PR...
@davidbenjamin Thank you for this one! |
Closes #3541. Closes #3608.
@lbergelson May I assign you by virtue of your github activity on this and related issues?
Note that
OverlapDetector::getOverlaps
returns an unsortedSet
so I sorted its output in a few places. I don't know if this was the right solution.