Non-text characters appearing in SEQ field #402

csoeder · 2023-09-06T21:35:23Z

I had an odd error crop up. I aligned my reads as usual:
bwa samse droSim1.fa fragSimulated_BCM10NE.clean.R0.fastq.droSim1.sai fragSimulated_BCM10NE.clean.R0.fastq -r '@RG\tID:foo\tSM:bar' > test.sam
(I have tried this with and without the readgroup flag, with the same outcome)

but when I tried to manipulate the alignment with samtools I got this:
$ samtools view test.sam > /dev/null
[W::sam_read1_sam] Parse error at line 28
samtools view: error reading file "test.sam"

There was nothing obviously wrong with line 28:

$ head -n 28 test.sam | tail -n 2
READ_7 0 chr2h_random 1166746 37 100M * 0 0 GCAAACCTATTTGAGCCTGCTTCAGACACGACGGTGAGGTATGCACTGTTTCGATGTAAAGAGAGTCGGCGCTCGTCTTGCTCATTTTGCCGCTGAGCGC BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB RG:Z:foo XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
READ_8 4 * 0 0 * * 0 0 TCGGTGCACAGAAAGAAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGGGGGTTGAGGCTTAGAAGGGGGCGTGGCCGGGCGGAT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB RG:Z:foo

I experimented a bit; deleting the offending line just finds another one at line 353, and so on. Some are unmapped; some are mapped. Oddly, although the file appeared to be an uncompressed SAM, it sometimes behaved as though it were a binary, such as not grepping properly:
grep READ test.sam
Binary file test.sam matches
file test.sam
test.sam: data

Then, I noticed that the problematic lines have a length mismatch between the SEQ and QUAL fields, which seemed like it could be an issue, but have never encountered this happening before.

Finally, while I was examining the file in less, I noticed this:

READ_8 4 * 0 0 * * 0 0 TCGGTGCACAGAAAGAAAANNNNNNNNNNNNNNNNNNNNNNNNN^@NNNNNNNNNNNNNNNNGGGGGGTTGAGGCTTAGAAGGGGGCGTGGCCGGGCGGAT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB RG:Z:foo

There seems to be a mystery character, "^@", which has found its way into the SEQ field instead of a proper character. I don't know how samtools interprets it, but when sent to standard output it just disappears, giving a string with one fewer character than it's supposed to have. Within less, it appears with the reverse coloration that you see when you open a binary file, which might explain some previous observations.

The only thing I'm doing that's slightly unusual is that I'm aligning pseudoreads synthetically generated by fragmenting a reference genome, but I don't think that's responsible (I've recently used this pipeline without issue).

Any tips on how to make this not happen?

Thanks,
Charlie

The text was updated successfully, but these errors were encountered:

csoeder changed the title ~~Non-text characters appearing in~~ Non-text characters appearing in SEQ field Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-text characters appearing in SEQ field #402

Non-text characters appearing in SEQ field #402

csoeder commented Sep 6, 2023

Non-text characters appearing in SEQ field #402

Non-text characters appearing in SEQ field #402

Comments

csoeder commented Sep 6, 2023