You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had an odd error crop up. I aligned my reads as usual: bwa samse droSim1.fa fragSimulated_BCM10NE.clean.R0.fastq.droSim1.sai fragSimulated_BCM10NE.clean.R0.fastq -r '@RG\tID:foo\tSM:bar' > test.sam
(I have tried this with and without the readgroup flag, with the same outcome)
but when I tried to manipulate the alignment with samtools I got this: $ samtools view test.sam > /dev/null [W::sam_read1_sam] Parse error at line 28
samtools view: error reading file "test.sam"
I experimented a bit; deleting the offending line just finds another one at line 353, and so on. Some are unmapped; some are mapped. Oddly, although the file appeared to be an uncompressed SAM, it sometimes behaved as though it were a binary, such as not grepping properly: grep READ test.sam Binary file test.sam matches file test.sam test.sam: data
Then, I noticed that the problematic lines have a length mismatch between the SEQ and QUAL fields, which seemed like it could be an issue, but have never encountered this happening before.
Finally, while I was examining the file in less, I noticed this:
There seems to be a mystery character, "^@", which has found its way into the SEQ field instead of a proper character. I don't know how samtools interprets it, but when sent to standard output it just disappears, giving a string with one fewer character than it's supposed to have. Within less, it appears with the reverse coloration that you see when you open a binary file, which might explain some previous observations.
The only thing I'm doing that's slightly unusual is that I'm aligning pseudoreads synthetically generated by fragmenting a reference genome, but I don't think that's responsible (I've recently used this pipeline without issue).
Any tips on how to make this not happen?
Thanks,
Charlie
The text was updated successfully, but these errors were encountered:
csoeder
changed the title
Non-text characters appearing in
Non-text characters appearing in SEQ field
Sep 6, 2023
I had an odd error crop up. I aligned my reads as usual:
bwa samse droSim1.fa fragSimulated_BCM10NE.clean.R0.fastq.droSim1.sai fragSimulated_BCM10NE.clean.R0.fastq -r '@RG\tID:foo\tSM:bar' > test.sam
(I have tried this with and without the readgroup flag, with the same outcome)
but when I tried to manipulate the alignment with samtools I got this:
$ samtools view test.sam > /dev/null
[W::sam_read1_sam] Parse error at line 28
samtools view: error reading file "test.sam"
There was nothing obviously wrong with line 28:
$ head -n 28 test.sam | tail -n 2
READ_7 0 chr2h_random 1166746 37 100M * 0 0 GCAAACCTATTTGAGCCTGCTTCAGACACGACGGTGAGGTATGCACTGTTTCGATGTAAAGAGAGTCGGCGCTCGTCTTGCTCATTTTGCCGCTGAGCGC BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB RG:Z:foo XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
READ_8 4 * 0 0 * * 0 0 TCGGTGCACAGAAAGAAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGGGGGTTGAGGCTTAGAAGGGGGCGTGGCCGGGCGGAT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB RG:Z:foo
I experimented a bit; deleting the offending line just finds another one at line 353, and so on. Some are unmapped; some are mapped. Oddly, although the file appeared to be an uncompressed SAM, it sometimes behaved as though it were a binary, such as not grepping properly:
grep READ test.sam
Binary file test.sam matches
file test.sam
test.sam: data
Then, I noticed that the problematic lines have a length mismatch between the SEQ and QUAL fields, which seemed like it could be an issue, but have never encountered this happening before.
Finally, while I was examining the file in less, I noticed this:
READ_8 4 * 0 0 * * 0 0 TCGGTGCACAGAAAGAAAANNNNNNNNNNNNNNNNNNNNNNNNN^@NNNNNNNNNNNNNNNNGGGGGGTTGAGGCTTAGAAGGGGGCGTGGCCGGGCGGAT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB RG:Z:foo
There seems to be a mystery character, "^@", which has found its way into the SEQ field instead of a proper character. I don't know how samtools interprets it, but when sent to standard output it just disappears, giving a string with one fewer character than it's supposed to have. Within less, it appears with the reverse coloration that you see when you open a binary file, which might explain some previous observations.
The only thing I'm doing that's slightly unusual is that I'm aligning pseudoreads synthetically generated by fragmenting a reference genome, but I don't think that's responsible (I've recently used this pipeline without issue).
Any tips on how to make this not happen?
Thanks,
Charlie
The text was updated successfully, but these errors were encountered: