Htsfree #4

LTLA · 2019-03-31T03:10:05Z

Fixes Refactor to use GenomicAlignments #3, from discussions with @jmacdon.
Also fixes Fix discard for paired end reads #2.
Note that getPESizes is not functional right now.

Modified behaviour of discard for paired-end data.

LTLA · 2019-03-31T08:36:57Z

Well, so much for the BamFile idea. To recap, the hope was that we could open the BamFile at the start of the calling function, and pass the resulting object to the BAM file-reading functions. This would avoid the overhead of setting up the BAM file handle at every iteration of a file read.

However, this doesn't work when you ask for paired reads by chromosome, because readGAlignmentPairs will also search for mates on other chromosomes - forcing the file pointer forwards and skipping the other chromosomes entirely when the calling function loops to it .

We can get around it by open and closeing the BamFile at every per-chromosome iteration, but I wonder if this would defeat the performance benefit of using a BamFile in the first place... @mtmorgan?

jmacdon · 2019-04-01T18:31:52Z

Does setting isProperPair = FALSE in the bamFlags not preclude that problem? Put a different way, are there aligners out there these days that would set the 0x2 bit if the two reads align to different chromosomes?

…

On Sun, Mar 31, 2019 at 4:37 AM Aaron Lun ***@***.***> wrote: Well, so much for the BamFile idea. To recap, the hope was that we could open the BamFile at the start of the calling function, and pass the resulting object to the BAM file-reading functions. This would avoid the overhead of setting up the BAM file handle at every iteration of a file read. However, this doesn't work when you ask for paired reads by chromosome, because readGAlignmentPairs will also search for mates on other chromosomes - forcing the file pointer forwards and skipping the other chromosomes entirely when the calling function loops to it . We can get around it by open and closeing the BamFile at every per-chromosome iteration, but I wonder if this would defeat the performance benefit of using a BamFile in the first place... @mtmorgan <https://github.com/mtmorgan>? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFqmvOfcpS2GUvorJkOhNYZCzOzrDAipks5vcHOqgaJpZM4cUFqZ> .

-- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

LTLA · 2019-04-02T02:04:38Z

I had thought about that, but was unwilling to rely on the aligner's definition of what a proper pair was. Notwithstanding inter-chromosomal pairs, different aligners might use different metrics for defining a proper pair - the maximum allowable fragment length is one such parameter that comes to mind.

If we had to do some filtering, the INS field would be much more standard for getting rid of inter-chromosomal pairs (as well as large fragments), as mentioned in Bioconductor/GenomicAlignments#4.

jmacdon · 2019-04-02T14:02:36Z

What is the INS field? I don't see that in the SAM spec.

…

On Mon, Apr 1, 2019 at 10:05 PM Aaron Lun ***@***.***> wrote: I had thought about that, but was unwilling to rely on the aligner's definition of what a proper pair was. Notwithstanding inter-chromosomal pairs, different aligners might use different metrics for defining a proper pair - the maximum allowable fragment length is one such parameter that comes to mind. If we had to do some filtering, the INS field would be much more standard for getting rid of inter-chromosomal pairs (as well as large fragments), as mentioned in Bioconductor/GenomicAlignments#4 <Bioconductor/GenomicAlignments#4>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFqmvGpqAnB4xLEJP7uVYhDBYYG8ghjtks5vcrq3gaJpZM4cUFqZ> .

-- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

LTLA · 2019-04-02T14:13:29Z

Oops! Well spotted, I got my names mixed up. I was referring to TLEN, which is known to (R)samtools as ISIZE (not entirely sure why those two have different names, but there we go).

jmacdon · 2019-04-02T14:57:40Z

The GAlignmentPairs object will have NA values for any cross-chromosomal pairs:

z <- readGAlignmentPairs(bf, param = param) z

GAlignmentPairs object with 9247 pairs, strandMode=1, and 0 metadata columns: seqnames strand : ranges -- ranges <Rle> <Rle> : <IRanges> -- <IRanges> [1] chr1 - : 3000136-3000181 -- 3000136-3000182 [2] chr1 + : 3000275-3000374 -- 3000460-3000560 [3] chr1 + : 3000387-3000487 -- 3000457-3000556 [4] chr1 + : 3000399-3000499 -- 3000490-3000590 [5] chr1 + : 3000535-3000635 -- 3000784-3000884 ... ... ... ... ... ... ... [9243] <NA> - : 130880221-130880321 -- 3771943-3771982 [9244] <NA> - : 5571716-5571782 -- 3087890-3087990 [9245] <NA> - : 14742478-14742542 -- 3802450-3802550 [9246] <NA> - : 41215388-41215454 -- 3507909-3507944 [9247] <NA> * : 20659260-20659360 -- 3949244-3949310 ------- seqinfo: 66 sequences from an unspecified genome Those last five are cross-chromosomal pairs

grglist(z[9246:9247,])

GRangesList object of length 2: [[1]] GRanges object with 2 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr7 41215388-41215454 - [2] chr1 3507909-3507944 - [[2]] GRanges object with 2 ranges and 0 metadata columns: seqnames ranges strand [1] chr8 20659260-20659360 + [2] chr1 3949244-3949310 - So you could hypothetically just read in the whole chromosome, and dump out the cross-chromosomal reads, then filter by fragment size

z <- z[!is.na(seqnames(z)),] z <- granges(z) z <- z[width(z) < 400,] z

GRanges object with 8922 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 3000136-3000182 - [2] chr1 3000275-3000560 + [3] chr1 3000387-3000556 + [4] chr1 3000399-3000590 + [5] chr1 3000535-3000884 + ... ... ... ... [8918] chr1 3999320-3999403 - [8919] chr1 3999323-3999545 + [8920] chr1 3999547-3999693 - [8921] chr1 3999620-3999962 - [8922] chr1 3999935-4000021 + ------- And that's all pretty fast, I think.

…

On Tue, Apr 2, 2019 at 10:14 AM Aaron Lun ***@***.***> wrote: Oops! Well spotted, I got my names mixed up. I was referring to TLEN, which is known to *(R)samtools* as ISIZE (not entirely sure why those two have different names, but there we go). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFqmvLoLgWAVc7L1YSiyQ6PEjE6fyC8eks5vc2WJgaJpZM4cUFqZ> .

-- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

jmacdon · 2019-04-02T15:56:51Z

Or there is the super-secret argument on.discordant.seqnames, known only to Herve and some Mossad agents:

bf <- BamFile("tmp_sorted.bam", asMates =TRUE) param <- ScanBamParam(which = GRanges("chr1:3000000-4000000")) z <- readGAlignmentPairs(bf, param = param) zz <- granges(z, on.discordant.seqnames = "drop") zz <- zz[width(zz) < 400,] zz

GRanges object with 8922 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 3000136-3000182 - [2] chr1 3000275-3000560 + [3] chr1 3000387-3000556 + [4] chr1 3000399-3000590 + [5] chr1 3000535-3000884 + ... ... ... ... [8918] chr1 3999320-3999403 - [8919] chr1 3999323-3999545 + [8920] chr1 3999547-3999693 - [8921] chr1 3999620-3999962 - [8922] chr1 3999935-4000021 + -------

…

On Tue, Apr 2, 2019 at 10:56 AM James W. MacDonald ***@***.***> wrote: The GAlignmentPairs object will have NA values for any cross-chromosomal pairs: > z <- readGAlignmentPairs(bf, param = param) > z GAlignmentPairs object with 9247 pairs, strandMode=1, and 0 metadata columns: seqnames strand : ranges -- ranges <Rle> <Rle> : <IRanges> -- <IRanges> [1] chr1 - : 3000136-3000181 -- 3000136-3000182 [2] chr1 + : 3000275-3000374 -- 3000460-3000560 [3] chr1 + : 3000387-3000487 -- 3000457-3000556 [4] chr1 + : 3000399-3000499 -- 3000490-3000590 [5] chr1 + : 3000535-3000635 -- 3000784-3000884 ... ... ... ... ... ... ... [9243] <NA> - : 130880221-130880321 -- 3771943-3771982 [9244] <NA> - : 5571716-5571782 -- 3087890-3087990 [9245] <NA> - : 14742478-14742542 -- 3802450-3802550 [9246] <NA> - : 41215388-41215454 -- 3507909-3507944 [9247] <NA> * : 20659260-20659360 -- 3949244-3949310 ------- seqinfo: 66 sequences from an unspecified genome Those last five are cross-chromosomal pairs > grglist(z[9246:9247,]) GRangesList object of length 2: [[1]] GRanges object with 2 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr7 41215388-41215454 - [2] chr1 3507909-3507944 - [[2]] GRanges object with 2 ranges and 0 metadata columns: seqnames ranges strand [1] chr8 20659260-20659360 + [2] chr1 3949244-3949310 - So you could hypothetically just read in the whole chromosome, and dump out the cross-chromosomal reads, then filter by fragment size > z <- z[!is.na(seqnames(z)),] > z <- granges(z) > z <- z[width(z) < 400,] > z GRanges object with 8922 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 3000136-3000182 - [2] chr1 3000275-3000560 + [3] chr1 3000387-3000556 + [4] chr1 3000399-3000590 + [5] chr1 3000535-3000884 + ... ... ... ... [8918] chr1 3999320-3999403 - [8919] chr1 3999323-3999545 + [8920] chr1 3999547-3999693 - [8921] chr1 3999620-3999962 - [8922] chr1 3999935-4000021 + ------- And that's all pretty fast, I think. On Tue, Apr 2, 2019 at 10:14 AM Aaron Lun ***@***.***> wrote: > Oops! Well spotted, I got my names mixed up. I was referring to TLEN, > which is known to *(R)samtools* as ISIZE (not entirely sure why those > two have different names, but there we go). > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#4 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AFqmvLoLgWAVc7L1YSiyQ6PEjE6fyC8eks5vc2WJgaJpZM4cUFqZ> > . > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

-- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

LTLA · 2019-04-02T22:30:03Z

To be clear, my concern isn't about filtering out inter-chromosomal read pairs in memory; it's about avoiding them being read into memory at all. The current state of this PR does memory-level filtering, but it should theoretically be possible to get better performance by skipping them during I/O.

Of course, if the current PR status is sufficiently fast for your use cases, I'll merge. I can't really check until I get my new laptop - 2GB of RAM is not enough for genomics these days.

LTLA · 2019-04-05T04:07:58Z

Well... this is disappointing.

library(Rsamtools)
bf <- system.file("exdata", "rep1.bam", package="csaw")
H <- scanBamHeader(bf)[[1]]$targets
H
## chrA chrB chrC 
## 1298  870 1345 

handle <- BamFile(bf)
open(handle)

scanBam(handle, param=ScanBamParam(what="pos", which=GRanges("chrA", IRanges(1, H[1]))))
## $`chrA:1-1298`
## $`chrA:1-1298`$pos
##   [1]    3    4    6    8   10   12   12   12   13   13   17   20   21   23   26
## ... etc.

scanBam(handle, param=ScanBamParam(what="pos", which=GRanges("chrB", IRanges(1, H[2]))))
## $`chrB:1-870`
## $`chrB:1-870`$pos
## integer(0)

scanBam(handle, param=ScanBamParam(what="pos", which=GRanges("chrC", IRanges(1, H[3]))))
## $`chrC:1-1345`
## $`chrC:1-1345`$pos
## integer(0)

close(handle)

As you can see, trying to retrieve reads on an open BAM file handle that's already been searched by position... doesn't work, even if the ensuing calls refer to reads that should occur later in the file.

mtmorgan · 2019-04-05T12:19:37Z

Construct the GRanges up-front https://support.bioconductor.org/p/119631/#119632 ? I guess the costs of opening a file are parsing the index and seeking the position; opening the BamFile once and using GRanges saves the cost of index parsing. One could also probably implement aseek,BamFile method and I'm happy to do that if you open an issue on the Rsamtools repository. Or maybe it's a bug...

jmacdon · 2019-04-05T14:08:25Z

Is the goal to be able to convert to using GenomicAlignments with as little disruption to your existing code base as possible? Or is the goal to limit ongoing memory usage while reading in data? If the latter, I think iterating through the bam file using yieldSize, and at each iteration converting to counts per window would limit the total memory required, but at the expense of requiring other changes to your code base.

…

On Fri, Apr 5, 2019, 8:20 AM Martin Morgan ***@***.***> wrote: Construct the GRanges up-front? https://support.bioconductor.org/p/119631/#119632 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFqmvLkpCgLWXo3JY8xA6PVFvaaruPFmks5vdz9ZgaJpZM4cUFqZ> .

jmacdon · 2019-04-05T14:20:11Z

Never mind, I see that it's the former.

…

On Fri, Apr 5, 2019, 10:06 AM James W. MacDonald ***@***.***> wrote: Is the goal to be able to convert to using GenomicAlignments with as little disruption to your existing code base as possible? Or is the goal to limit ongoing memory usage while reading in data? If the latter, I think iterating through the bam file using yieldSize, and at each iteration converting to counts per window would limit the total memory required, but at the expense of requiring other changes to your code base. On Fri, Apr 5, 2019, 8:20 AM Martin Morgan ***@***.***> wrote: > Construct the GRanges up-front? > https://support.bioconductor.org/p/119631/#119632 > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#4 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AFqmvLkpCgLWXo3JY8xA6PVFvaaruPFmks5vdz9ZgaJpZM4cUFqZ> > . >

jmacdon · 2019-04-05T14:28:17Z

Also, @martinmorgan, using a GRanges of different lengths (like by-chromosome) seems to suffer from the same problem.

…

On Fri, Apr 5, 2019, 10:18 AM James W. MacDonald ***@***.***> wrote: Never mind, I see that it's the former. On Fri, Apr 5, 2019, 10:06 AM James W. MacDonald ***@***.***> wrote: > Is the goal to be able to convert to using GenomicAlignments with as > little disruption to your existing code base as possible? > > Or is the goal to limit ongoing memory usage while reading in data? > > If the latter, I think iterating through the bam file using yieldSize, > and at each iteration converting to counts per window would limit the total > memory required, but at the expense of requiring other changes to your code > base. > > On Fri, Apr 5, 2019, 8:20 AM Martin Morgan ***@***.***> > wrote: > >> Construct the GRanges up-front? >> https://support.bioconductor.org/p/119631/#119632 >> >> — >> You are receiving this because you were mentioned. >> Reply to this email directly, view it on GitHub >> <#4 (comment)>, or mute >> the thread >> <https://github.com/notifications/unsubscribe-auth/AFqmvLkpCgLWXo3JY8xA6PVFvaaruPFmks5vdz9ZgaJpZM4cUFqZ> >> . >> >

mtmorgan · 2019-04-05T14:30:25Z

@jmacdon can you describe what you mean at Bioconductor/Rsamtools#6 ?

LTLA · 2019-04-05T15:24:54Z

Thanks @mtmorgan, the below seems to work for me:

library(Rsamtools)
bf <- system.file("exdata", "rep1.bam", package="csaw")
H <- scanBamHeader(bf)[[1]]$targets
H
## chrA chrB chrC
## 1298  870 1345

handle <- BamFile(bf, yieldSize=1)
all.ref <- GRanges(names(H), IRanges(1, H))
param <- ScanBamParam(what="pos", which=all.ref)

open(handle)

scanBam(handle, param=param)

scanBam(handle, param=param)

scanBam(handle, param=param)

close(handle)

@jmacdon; yes, the idea would be to just swap in the existing read input functions with GenomicAlignments, without having to rewrite a whole lot of the surrounding context: while still preserving, as much as possible, the current performance characteristics. We're almost there; the performance degradation is acceptable for standard applications, it's just this scaffold case that sucks.

jmacdon · 2019-04-05T17:28:16Z

Yes, that works for simple queries, but not for readGAlignmentPairs, which I imagine is the most common use case these days:

b <-

BamFile("aligned_nodups_20190130_acomys/1-D-1Aligned.sortedByCoord.out.bam", yieldSize = 1L, asMates = TRUE)

open(b) repeat{

+ aln <- readGAlignmentPairs(b, param = param) + if(length(aln) == 0L) + break + print(head(table(seqnames(aln)), 2)) + } LAS1 LAS2 64280 0 Warning message: In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 26 alignments with ambiguous pairing were dumped. Use 'getDumpedAlignments()' to retrieve them from the dump environment.

…

On Fri, Apr 5, 2019 at 11:30 AM Aaron Lun ***@***.***> wrote: Thanks @mtmorgan <https://github.com/mtmorgan>, the below seems to work for me: library(Rsamtools)bf <- system.file("exdata", "rep1.bam", package="csaw")H <- scanBamHeader(bf)[[1]]$targetsH## chrA chrB chrC## 1298 870 1345 handle <- BamFile(bf, yieldSize=1)all.ref <- GRanges(names(H), IRanges(1, H))param <- ScanBamParam(what="pos", which=all.ref) open(handle) scanBam(handle, param=param) scanBam(handle, param=param) scanBam(handle, param=param) close(handle) @jmacdon <https://github.com/jmacdon>; yes, the idea would be to just swap in the existing read input functions with *GenomicAlignments*, without having to rewrite a whole lot of the surrounding context: *while still preserving, as much as possible, the current performance characteristics*. We're almost there; the performance degradation is acceptable for standard applications, it's just this scaffold case that sucks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFqmvPuIpaU9lS__m7QLkX2udlOOFy-Gks5vd2rGgaJpZM4cUFqZ> .

-- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

LTLA · 2019-04-06T03:29:23Z

@jmacdon One more roll of the dice. I've just switched windowCounts to use the scheme suggested by @mtmorgan; can you see how fast it runs on your scaffolds in single-end mode?

jmacdon · 2019-04-09T13:24:13Z

Legit...

pe.param.acomys <- readParam(max.frag = 400,minq = 200, BPPARAM =

MulticoreParam(10))

system.time(windowCounts(asamps$files, ext = 250, width = 150, spacing =

75, param = pe.param.acomys)) user system elapsed 11075.76 19712.78 13859.93 I ran that under the assumption that I didn't need to use bpstart on the BPPARAM object, but maybe I did need to?

…

On Fri, Apr 5, 2019 at 11:30 PM Aaron Lun ***@***.***> wrote: @jmacdon <https://github.com/jmacdon> One more roll of the dice. I've just switched windowCounts to use the scheme suggested by @mtmorgan <https://github.com/mtmorgan>; can you see how fast it runs on your scaffolds in single-end mode? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFqmvO2H6e9UmegETOZYAFdlYh-etnFvks5veBSUgaJpZM4cUFqZ> .

-- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

jmacdon · 2019-04-09T13:24:20Z

Ugh. Never mind. That was the release version. Re-running now.

…

On Tue, Apr 9, 2019 at 9:22 AM James W. MacDonald ***@***.***> wrote: Legit... > pe.param.acomys <- readParam(max.frag = 400,minq = 200, BPPARAM = MulticoreParam(10)) > system.time(windowCounts(asamps$files, ext = 250, width = 150, spacing = 75, param = pe.param.acomys)) user system elapsed 11075.76 19712.78 13859.93 I ran that under the assumption that I didn't need to use bpstart on the BPPARAM object, but maybe I did need to? On Fri, Apr 5, 2019 at 11:30 PM Aaron Lun ***@***.***> wrote: > @jmacdon <https://github.com/jmacdon> One more roll of the dice. I've > just switched windowCounts to use the scheme suggested by @mtmorgan > <https://github.com/mtmorgan>; can you see how fast it runs on your > scaffolds in single-end mode? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#4 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AFqmvO2H6e9UmegETOZYAFdlYh-etnFvks5veBSUgaJpZM4cUFqZ> > . > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

-- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

LTLA · 2019-04-10T02:59:12Z

... is it still running?

jmacdon · 2019-04-10T03:56:08Z

Yes. Without using bpstart(), which I assume is superfluous at this point, as it seems you use that internally.

…

On Tue, Apr 9, 2019 at 11:00 PM Aaron Lun ***@***.***> wrote: ... is it still running? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFqmvKJaBoqCtYnLhwZ22aYGDuN5G3Arks5vfVOAgaJpZM4cUFqZ> .

-- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

jmacdon · 2019-04-10T13:35:07Z

Wow. Still running this morning...

…

On Tue, Apr 9, 2019 at 11:52 PM James W. MacDonald ***@***.***> wrote: Yes. Without using bpstart(), which I assume is superfluous at this point, as it seems you use that internally. On Tue, Apr 9, 2019 at 11:00 PM Aaron Lun ***@***.***> wrote: > ... is it still running? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#4 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AFqmvKJaBoqCtYnLhwZ22aYGDuN5G3Arks5vfVOAgaJpZM4cUFqZ> > . > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

-- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

jmacdon · 2019-04-10T14:32:03Z

Possibly helpful:

system.time(windowCounts(asamps$files, ext = 250, width = 150, spacing =

75, param = pe.param.acomys, BPPARAM = MulticoreParam(10))) C-c C-c Warning messages: 1: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 2: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 3: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 4: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 5: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 6: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 7: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 8: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 9: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow Timing stopped at: 527.6 64.32 6.355e+04

…

On Wed, Apr 10, 2019 at 9:32 AM James W. MacDonald ***@***.***> wrote: Wow. Still running this morning... On Tue, Apr 9, 2019 at 11:52 PM James W. MacDonald ***@***.***> wrote: > Yes. Without using bpstart(), which I assume is superfluous at this > point, as it seems you use that internally. > > On Tue, Apr 9, 2019 at 11:00 PM Aaron Lun ***@***.***> > wrote: > >> ... is it still running? >> >> — >> You are receiving this because you were mentioned. >> Reply to this email directly, view it on GitHub >> <#4 (comment)>, or mute >> the thread >> <https://github.com/notifications/unsubscribe-auth/AFqmvKJaBoqCtYnLhwZ22aYGDuN5G3Arks5vfVOAgaJpZM4cUFqZ> >> . >> > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

-- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

jmacdon · 2019-04-10T14:32:13Z

And for completeness

sessionInfo()

R Under development (unstable) (2019-03-19 r76252) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 8 (jessie) Matrix products: default BLAS: /data/oldR/R-devel/lib64/R/lib/libRblas.so LAPACK: /data/oldR/R-devel/lib64/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] csaw_1.17.8 SummarizedExperiment_1.13.0 [3] DelayedArray_0.9.9 BiocParallel_1.17.18 [5] matrixStats_0.54.0 Biobase_2.43.1 [7] GenomicRanges_1.35.1 GenomeInfoDb_1.19.2 [9] IRanges_2.17.4 S4Vectors_0.21.21 [11] BiocGenerics_0.29.2 loaded via a namespace (and not attached): [1] Rcpp_1.0.1 compiler_3.6.0 XVector_0.23.2 [4] prettyunits_1.0.2 GenomicFeatures_1.35.9 bitops_1.0-6 [7] tools_3.6.0 zlibbioc_1.29.0 progress_1.2.0 [10] biomaRt_2.39.2 digest_0.6.18 bit_1.1-14 [13] RSQLite_2.1.1 memoise_1.1.0 lattice_0.20-38 [16] pkgconfig_2.0.2 rlang_0.3.3 Matrix_1.2-17 [19] DBI_1.0.0 GenomeInfoDbData_1.2.0 rtracklayer_1.43.3 [22] httr_1.4.0 stringr_1.4.0 hms_0.4.2 [25] Biostrings_2.51.5 locfit_1.5-9.1 bit64_0.9-7 [28] grid_3.6.0 R6_2.4.0 AnnotationDbi_1.45.1 [31] XML_3.98-1.19 limma_3.39.14 edgeR_3.25.3 [34] magrittr_1.5 blob_1.1.1 Rsamtools_1.99.4 [37] GenomicAlignments_1.19.1 assertthat_0.2.1 stringi_1.4.3 [40] RCurl_1.95-4.12 crayon_1.3.4

…

On Wed, Apr 10, 2019 at 10:29 AM James W. MacDonald ***@***.***> wrote: Possibly helpful: > system.time(windowCounts(asamps$files, ext = 250, width = 150, spacing = 75, param = pe.param.acomys, BPPARAM = MulticoreParam(10))) C-c C-c Warning messages: 1: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 2: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 3: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 4: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 5: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 6: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 7: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 8: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow 9: In totals[bf] + bp.out[[bf]]$totals : NAs produced by integer overflow Timing stopped at: 527.6 64.32 6.355e+04 On Wed, Apr 10, 2019 at 9:32 AM James W. MacDonald ***@***.***> wrote: > Wow. Still running this morning... > > On Tue, Apr 9, 2019 at 11:52 PM James W. MacDonald ***@***.***> > wrote: > >> Yes. Without using bpstart(), which I assume is superfluous at this >> point, as it seems you use that internally. >> >> On Tue, Apr 9, 2019 at 11:00 PM Aaron Lun ***@***.***> >> wrote: >> >>> ... is it still running? >>> >>> — >>> You are receiving this because you were mentioned. >>> Reply to this email directly, view it on GitHub >>> <#4 (comment)>, or mute >>> the thread >>> <https://github.com/notifications/unsubscribe-auth/AFqmvKJaBoqCtYnLhwZ22aYGDuN5G3Arks5vfVOAgaJpZM4cUFqZ> >>> . >>> >> >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

-- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

LTLA · 2019-04-10T23:08:14Z

Oh geez. Let me double-check my code, maybe I did something wrong.

LTLA · 2019-04-11T06:45:40Z

Well, I don't think I stuffed anything up. My small examples don't show any difference between this branch and master. The warning messages imply that windowCounts is incorrectly loading in many, many more reads, but I don't know why. Guess we'll just have to sit tight and wait for Bioconductor/Rsamtools#6.

LTLA added 9 commits March 28, 2019 22:04

Switch back to using overlapsAny rather than intersector.

44b3abc

Modified behaviour of discard for paired-end data.

Began updating some tests for new PE behaviour.

e7ffd48

Read extraction with GenomicAlignments instead of HTSLib.

1e07316

Eliminated traces of HTSlib.

cded084

Refactored extractPE to report diagnostics for getPESizes.

39959b4

Switched to double for max.frag in readParam.

f39a644

Updated tests for slimmed-down getPESizes functionality.

6e08a81

Rebuilt NAMESPACE.

401c856

Typo fix, test fix.

89f29d9

LTLA mentioned this pull request Mar 31, 2019

Performance of readGAlignmentPairs with inter-chromosomal pairs Bioconductor/GenomicAlignments#4

Open

LTLA added 3 commits March 31, 2019 14:26

Bugfix for correct inter.chr counts in getPESizes.

0097e30

updated documentation for new paired-end handling.

baadbdf

Minor bugfixes to pass check.

8e82eec

Experiment with passing BamFiles to internals in windowCounts.

dbe55fc

mtmorgan mentioned this pull request Apr 5, 2019

handle multiple calls to scanBam() on an open BamFile Bioconductor/Rsamtools#6

Open

Open BAM handles for single-end speed testing.

4fda1c4

LTLA force-pushed the master branch from e7ffd48 to 250f29b Compare April 26, 2019 00:39

LTLA mentioned this pull request May 22, 2019

"xxx.bam" indexes as "xxx.bai" or "xxx.bam.bai" #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Htsfree #4

Htsfree #4

LTLA commented Mar 31, 2019

LTLA commented Mar 31, 2019

jmacdon commented Apr 1, 2019 via email

LTLA commented Apr 2, 2019

jmacdon commented Apr 2, 2019 via email

LTLA commented Apr 2, 2019

jmacdon commented Apr 2, 2019 via email

jmacdon commented Apr 2, 2019 via email

LTLA commented Apr 2, 2019

LTLA commented Apr 5, 2019

mtmorgan commented Apr 5, 2019 •

edited

Loading

jmacdon commented Apr 5, 2019 via email

jmacdon commented Apr 5, 2019 via email

jmacdon commented Apr 5, 2019 via email

mtmorgan commented Apr 5, 2019

LTLA commented Apr 5, 2019

jmacdon commented Apr 5, 2019 via email

LTLA commented Apr 6, 2019

jmacdon commented Apr 9, 2019 via email

jmacdon commented Apr 9, 2019 via email

LTLA commented Apr 10, 2019

jmacdon commented Apr 10, 2019 via email

jmacdon commented Apr 10, 2019 via email

jmacdon commented Apr 10, 2019 via email

jmacdon commented Apr 10, 2019 via email

LTLA commented Apr 10, 2019

LTLA commented Apr 11, 2019

Htsfree #4

Are you sure you want to change the base?

Htsfree #4

Conversation

LTLA commented Mar 31, 2019

LTLA commented Mar 31, 2019

jmacdon commented Apr 1, 2019 via email

LTLA commented Apr 2, 2019

jmacdon commented Apr 2, 2019 via email

LTLA commented Apr 2, 2019

jmacdon commented Apr 2, 2019 via email

jmacdon commented Apr 2, 2019 via email

LTLA commented Apr 2, 2019

LTLA commented Apr 5, 2019

mtmorgan commented Apr 5, 2019 • edited Loading

jmacdon commented Apr 5, 2019 via email

jmacdon commented Apr 5, 2019 via email

jmacdon commented Apr 5, 2019 via email

mtmorgan commented Apr 5, 2019

LTLA commented Apr 5, 2019

jmacdon commented Apr 5, 2019 via email

LTLA commented Apr 6, 2019

jmacdon commented Apr 9, 2019 via email

jmacdon commented Apr 9, 2019 via email

LTLA commented Apr 10, 2019

jmacdon commented Apr 10, 2019 via email

jmacdon commented Apr 10, 2019 via email

jmacdon commented Apr 10, 2019 via email

jmacdon commented Apr 10, 2019 via email

LTLA commented Apr 10, 2019

LTLA commented Apr 11, 2019

mtmorgan commented Apr 5, 2019 •

edited

Loading