Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange results #15

Open
andrewjmc opened this issue Aug 17, 2021 · 1 comment
Open

Strange results #15

andrewjmc opened this issue Aug 17, 2021 · 1 comment

Comments

@andrewjmc
Copy link

Hello,

Thanks for this amazing tool. I am using fastv in perhaps an unusual way. I'm looking to detect the presence or absence of homologous gene clusters in metagenomic data. I started with ~17 million ORFs from our contigs, and clustered (mmseqs2, 95% identity) them into almost 4 million clusters. I want to detecte the clusters by detecting one or more unique kmers from their representative sequences.

I used unique_kmer (initially) to identify unique 24mers, but this took a long time, generated millions of files and unique 24-mers could not be found for a majority of sequences. I fell back on jellyfish, 32mers, and a convoluted pipeline of aligning the kmers against the cluster representatives and then filtering the sorted SAM file to include only non-overlapping kmers. This way the vast majority of cluster representatives had one or more unique kmers (mean of 3 and up to hundreds).

I applied fastv with minimal filtering and lowest thresholds (-A -G -Q -L -p 0.001 -d 0.001), but only ~100k of ~3.4 million cluster representatives are ever identified across all >200 samples.

I tested further by extracting only unique kmers from one sample and testing them against the sample reads: no hits!

Yet, when I search with seqkit, I find that the sequence file does indeed contain this kmer three times: "ATGAAATTCCATGGAATGGAATGGAATGGAAA"

e.g.
@K00150:405:HGVM5BBXY:6:2119:16741:38381/1
ATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGA*ATGAAATTCCATGGAATGGAATGGAATGGAAA*AGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGAAATTCC
+
AAAFFJJJJJFAJJJJFJAFAAFJJFAFFJJAJJFF-FJAJ<JFJFJFFJFJJJFFAJJJFFJJJJJJJJFFJFFAAFJJFAFJFA-FJJFFJJJJFFFJFFJJJJFJFJJFJJJJFJJJF7FJFFFJJJF7FJJAAAFJF7AFJJFFAF

Can you advise why fastv seems not to be detecting it?

Thanks!

@andrewjmc
Copy link
Author

Just to add as a test case, I have provided a fastq sequence file with three reads

@K00150:405:HGVM5BBXY:6:2119:16741:38381/1
ATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAAAAGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGAAATTCC
+
AAAFFJJJJJFAJJJJFJAFAAFJJFAFFJJAJJFF-FJAJ<JFJFJFFJFJJJFFAJJJFFJJJJJJJJFFJFFAAFJJFAFJFA-FJJFFJJJJFFFJFFJJJJFJFJJFJJJJFJJJF7FJFFFJJJF7FJJAAAFJF7AFJJFFAF
@K00150:414:HK5NLBBXY:1:1121:16457:6976/1
AATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAAAAGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGAAATTCCATGGAATGGAATGG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJFJJJJJJJJFJJJJJAFJJJFFF7<AJJJJFJFJJFJJJJJJJJJJJJJFJJFJJJJJFFFFAJFJJJJJJFJFFJJJF7---77AF<JFJJ-7-
@K00150:414:HK5NLBBXY:2:2212:24393:37396/1
AATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAAAAGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGAAATTCCACGGAATGGAATGG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJ

And a kmer database of:

>test
ATGAAATTCCATGGAATGGAATGGAATGGAAA

or

>test
ATGAAATTCCATGGAATGGAA

Neither the 21-mer or 32-mer was found in the sequences despite their presence.

21-mer example:

{
        "kmer_collection_scan_result": {

        },
        "summary": {
                "before_filtering": {
                        "total_reads":3,
                        "total_bases":450,
                        "q20_bases":443,
                        "q30_bases":432,
                        "q20_rate":0.984444,
                        "q30_rate":0.96,
                        "read1_mean_length":150,
                        "gc_content":0.373333
                },
                "after_filtering": {
                        "total_reads":3,
                        "total_bases":450,
                        "q20_bases":443,
                        "q30_bases":432,
                        "q20_rate":0.984444,
                        "q30_rate":0.96,
                        "read1_mean_length":150,
                        "gc_content":0.373333
                }
        },
        "filtering_result": {
                "passed_filter_reads": 3,
                "low_quality_reads": 0,
                "too_many_N_reads": 0,
                "too_short_reads": 0,
                "too_long_reads": 0
        },
        "duplication": {
                "rate": 0,
                "histogram": [2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
                "mean_gc": [0.264706,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
        },
        "read1_before_filtering": {
                "total_reads": 3,
                "total_bases": 450,
                "q20_bases": 443,
                "q30_bases": 432,
                "total_cycles": 150,
                "quality_curves": {
                        ...                },
                "content_curves": {
                       ...
                }
        },
        "read1_after_filtering": {
                "total_reads": 3,
                "total_bases": 450,
                "q20_bases": 443,
                "q30_bases": 432,
                "total_cycles": 150,
                "quality_curves": {
                    ...
                },
                "content_curves": {
                     ....
                }
        },
        "command": "fastv -i tmp_test_targets.fastq -c tmp_test_kmers.fasta -A -G -V -Q -L -j tmp_test_fastv.json -h tmp_test_fastv.html -p 0.001 -d 0.001 "

I must misunderstand something important about how fastv is operating, or there is an error. I will be grateful for any advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant