-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible off-by-1 error at chunksize boundary #21
Comments
Either way this pans out (valid issue with grabix or some weird fluke on my side), I suggest it may make sense to write unit tests for multi-line-record input files that check for validity of the grabbed records. FASTQ files seem to be a reasonably choice for such a test file (VCF not so much, since off-by-1s may be more difficult to find if also the end position shifts by one). |
Thanks for reporting this. I suspect the fix we added the other day for small files had an unexpected consequence. We will try to fix this and add functional tests asap. |
I've found the source of this. I'll push shortly with tests. |
Fixes off by one error (arq5x/grabix#21 bcbio/bcbio-nextgen#854)
When working on
grabix
indexed FASTQ files inbcbio
, I get an off-by-one error since the latest commit six days ago (version 0.1.5) when crossing the chunksize (which is 10000), possibly due to an earlier off-by-1 fix in 8d81fed that may not have had the intended result (the latter is pure speculation on my side). As a result, I get malformed FASTQ files when grabbing ranges of rows that are multiplies of 4 (the length of a canonical FASTQ record) whenever I perform a grab that starts at more than 10000 lines.For instance, this is a asking for a grab that starts with the first entry of a FASTQ record (the FASTQ ID), since
9997 % 4 == 1
and it works fine since we start below the chunksize:However, when I start the grab on the exact same file and start 4 rows further along at start position
10001
(10001 % 4 == 1)
which should give me the FASTQ ID of the next record as first line, I get:Note that the latter entry does not start with line
10001
but with line10002
, and also ends with line10009
instead of line10008
. The input FASTQ file can be parsed fine using a hand-made parser that checks for validity. Also, the validity of the input file can be checked here in the first output (see above) that also covers the chunksize boundary and includes all relevant records that are erroneously chunked in the second entry.The text was updated successfully, but these errors were encountered: