Possible off-by-1 error at chunksize boundary #21

schelhorn · 2015-05-20T11:55:01Z

When working on grabix indexed FASTQ files in bcbio, I get an off-by-one error since the latest commit six days ago (version 0.1.5) when crossing the chunksize (which is 10000), possibly due to an earlier off-by-1 fix in 8d81fed that may not have had the intended result (the latter is pure speculation on my side). As a result, I get malformed FASTQ files when grabbing ranges of rows that are multiplies of 4 (the length of a canonical FASTQ record) whenever I perform a grab that starts at more than 10000 lines.

For instance, this is a asking for a grab that starts with the first entry of a FASTQ record (the FASTQ ID), since 9997 % 4 == 1 and it works fine since we start below the chunksize:

$ grabix grab SAMPLE.fq.gz 9997 10012
@2500/1
CATAGCCATGGCCTTTGACAGATACATGGCCATATGTAAACCTCTCCACTACCTGACCATCATGAGCCCAAGAATGTGTCTATACTTTTT
+
GGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGFGGGGGGGGFGGGEGGEGGGGGGEGGGEGGFEGGGGF
@2501/1
CGGTCCGGGGTGGTCCAGTCTGATCCCAACCGGCCCACCCGGGGCATCCGGTGGAAGTCTTCGCCGGAGGATCCGAAGGCAGCATCAACG
+
GGGGGEFFFFBFECFFFFFFGGGGGGGGGGGEFGGGGGGGGGEGG?:EEEECEEEEFEGGG=GEGGGEGGEEFEFBDCDA?DB?DF?EFE
@2502/1
GAGGCTGTGCTGTTTCTCTTTCTCTTCTAGGATATCAAAGTCGTCTTGTGCAGGTATGATGGGCCACCCCAAATGTATTTTGCCTACTGA
+
GGGGGGGGGGGGGGGGGGGGGGFGGGEGDGGGGGGFGGGGBGGGGGGEEGFEFFCEFDEEGEGDGGEGGE=DEEEAGFGGFBFFD=ED=C
@2503/1
AGTCTTATTGTTGGTTTGACTGTGAGTATATGGAACTCTGGCAAAATGAAATGGAGCAAAACACACAAAGAAGACAGCCACGACAACAAA
+
GGGGGGGGGGGGDAGFGBGGGFFGGGEDFEGGGGEGGGGGGGGGGGGGEGGGGDEGGGFDGGGFGGGFDEDGGGGGGGGGGGDGGFGDD=

However, when I start the grab on the exact same file and start 4 rows further along at start position 10001 (10001 % 4 == 1) which should give me the FASTQ ID of the next record as first line, I get:

$ grabix grab SAMPLE.fq.gz 10001 1008
CGGTCCGGGGTGGTCCAGTCTGATCCCAACCGGCCCACCCGGGGCATCCGGTGGAAGTCTTCGCCGGAGGATCCGAAGGCAGCATCAACG
+
GGGGGEFFFFBFECFFFFFFGGGGGGGGGGGEFGGGGGGGGGEGG?:EEEECEEEEFEGGG=GEGGGEGGEEFEFBDCDA?DB?DF?EFE
@2502/1
GAGGCTGTGCTGTTTCTCTTTCTCTTCTAGGATATCAAAGTCGTCTTGTGCAGGTATGATGGGCCACCCCAAATGTATTTTGCCTACTGA
+
GGGGGGGGGGGGGGGGGGGGGGFGGGEGDGGGGGGFGGGGBGGGGGGEEGFEFFCEFDEEGEGDGGEGGE=DEEEAGFGGFBFFD=ED=C
@2503/1

Note that the latter entry does not start with line 10001 but with line 10002, and also ends with line 10009 instead of line 10008. The input FASTQ file can be parsed fine using a hand-made parser that checks for validity. Also, the validity of the input file can be checked here in the first output (see above) that also covers the chunksize boundary and includes all relevant records that are erroneously chunked in the second entry.

The text was updated successfully, but these errors were encountered:

schelhorn · 2015-05-20T12:50:58Z

Either way this pans out (valid issue with grabix or some weird fluke on my side), I suggest it may make sense to write unit tests for multi-line-record input files that check for validity of the grabbed records. FASTQ files seem to be a reasonably choice for such a test file (VCF not so much, since off-by-1s may be more difficult to find if also the end position shifts by one).

arq5x · 2015-05-20T12:57:22Z

Thanks for reporting this. I suspect the fix we added the other day for small files had an unexpected consequence. We will try to fix this and add functional tests asap.

brentp · 2015-05-20T14:00:00Z

I've found the source of this. I'll push shortly with tests.

Fixes off by one error (arq5x/grabix#21 bcbio/bcbio-nextgen#854)

schelhorn mentioned this issue May 20, 2015

Possible off-by-1 error in grabix bcbio/bcbio-nextgen#854

Closed

arq5x closed this as completed in ba792bc May 20, 2015

chapmanb added a commit to chapmanb/homebrew-science that referenced this issue May 20, 2015

grabix: 0.1.6 update

54f28b9

Fixes off by one error (arq5x/grabix#21 bcbio/bcbio-nextgen#854)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible off-by-1 error at chunksize boundary #21

Possible off-by-1 error at chunksize boundary #21

schelhorn commented May 20, 2015

schelhorn commented May 20, 2015

arq5x commented May 20, 2015

brentp commented May 20, 2015

Possible off-by-1 error at chunksize boundary #21

Possible off-by-1 error at chunksize boundary #21

Comments

schelhorn commented May 20, 2015

schelhorn commented May 20, 2015

arq5x commented May 20, 2015

brentp commented May 20, 2015