Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible off-by-1 error at chunksize boundary #21

Closed
schelhorn opened this issue May 20, 2015 · 3 comments
Closed

Possible off-by-1 error at chunksize boundary #21

schelhorn opened this issue May 20, 2015 · 3 comments

Comments

@schelhorn
Copy link

When working on grabix indexed FASTQ files in bcbio, I get an off-by-one error since the latest commit six days ago (version 0.1.5) when crossing the chunksize (which is 10000), possibly due to an earlier off-by-1 fix in 8d81fed that may not have had the intended result (the latter is pure speculation on my side). As a result, I get malformed FASTQ files when grabbing ranges of rows that are multiplies of 4 (the length of a canonical FASTQ record) whenever I perform a grab that starts at more than 10000 lines.

For instance, this is a asking for a grab that starts with the first entry of a FASTQ record (the FASTQ ID), since 9997 % 4 == 1 and it works fine since we start below the chunksize:

$ grabix grab SAMPLE.fq.gz 9997 10012
@2500/1
CATAGCCATGGCCTTTGACAGATACATGGCCATATGTAAACCTCTCCACTACCTGACCATCATGAGCCCAAGAATGTGTCTATACTTTTT
+
GGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGFGGGGGGGGFGGGEGGEGGGGGGEGGGEGGFEGGGGF
@2501/1
CGGTCCGGGGTGGTCCAGTCTGATCCCAACCGGCCCACCCGGGGCATCCGGTGGAAGTCTTCGCCGGAGGATCCGAAGGCAGCATCAACG
+
GGGGGEFFFFBFECFFFFFFGGGGGGGGGGGEFGGGGGGGGGEGG?:EEEECEEEEFEGGG=GEGGGEGGEEFEFBDCDA?DB?DF?EFE
@2502/1
GAGGCTGTGCTGTTTCTCTTTCTCTTCTAGGATATCAAAGTCGTCTTGTGCAGGTATGATGGGCCACCCCAAATGTATTTTGCCTACTGA
+
GGGGGGGGGGGGGGGGGGGGGGFGGGEGDGGGGGGFGGGGBGGGGGGEEGFEFFCEFDEEGEGDGGEGGE=DEEEAGFGGFBFFD=ED=C
@2503/1
AGTCTTATTGTTGGTTTGACTGTGAGTATATGGAACTCTGGCAAAATGAAATGGAGCAAAACACACAAAGAAGACAGCCACGACAACAAA
+
GGGGGGGGGGGGDAGFGBGGGFFGGGEDFEGGGGEGGGGGGGGGGGGGEGGGGDEGGGFDGGGFGGGFDEDGGGGGGGGGGGDGGFGDD=

However, when I start the grab on the exact same file and start 4 rows further along at start position 10001 (10001 % 4 == 1) which should give me the FASTQ ID of the next record as first line, I get:

$ grabix grab SAMPLE.fq.gz 10001 1008
CGGTCCGGGGTGGTCCAGTCTGATCCCAACCGGCCCACCCGGGGCATCCGGTGGAAGTCTTCGCCGGAGGATCCGAAGGCAGCATCAACG
+
GGGGGEFFFFBFECFFFFFFGGGGGGGGGGGEFGGGGGGGGGEGG?:EEEECEEEEFEGGG=GEGGGEGGEEFEFBDCDA?DB?DF?EFE
@2502/1
GAGGCTGTGCTGTTTCTCTTTCTCTTCTAGGATATCAAAGTCGTCTTGTGCAGGTATGATGGGCCACCCCAAATGTATTTTGCCTACTGA
+
GGGGGGGGGGGGGGGGGGGGGGFGGGEGDGGGGGGFGGGGBGGGGGGEEGFEFFCEFDEEGEGDGGEGGE=DEEEAGFGGFBFFD=ED=C
@2503/1

Note that the latter entry does not start with line 10001 but with line 10002, and also ends with line 10009 instead of line 10008. The input FASTQ file can be parsed fine using a hand-made parser that checks for validity. Also, the validity of the input file can be checked here in the first output (see above) that also covers the chunksize boundary and includes all relevant records that are erroneously chunked in the second entry.

@schelhorn
Copy link
Author

Either way this pans out (valid issue with grabix or some weird fluke on my side), I suggest it may make sense to write unit tests for multi-line-record input files that check for validity of the grabbed records. FASTQ files seem to be a reasonably choice for such a test file (VCF not so much, since off-by-1s may be more difficult to find if also the end position shifts by one).

@arq5x
Copy link
Owner

arq5x commented May 20, 2015

Thanks for reporting this. I suspect the fix we added the other day for small files had an unexpected consequence. We will try to fix this and add functional tests asap.

@brentp
Copy link
Collaborator

brentp commented May 20, 2015

I've found the source of this. I'll push shortly with tests.

@arq5x arq5x closed this as completed in ba792bc May 20, 2015
chapmanb added a commit to chapmanb/homebrew-science that referenced this issue May 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants