Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF out of order or chomosome not present #13

Closed
stephenturner opened this issue Feb 14, 2021 · 5 comments
Closed

VCF out of order or chomosome not present #13

stephenturner opened this issue Feb 14, 2021 · 5 comments

Comments

@stephenturner
Copy link
Contributor

stephenturner commented Feb 14, 2021

Hello! Thanks for making this open-source and for putting together such an informative README. I'm still wrapping my brain around branching and the pedigree def file, but for now, I'm using your examples trying to simulate genetic data from a VCF file, and I'm getting an unexpected error about chromosomes in my VCF either out of order or not present.

I first created the map file exactly as specified in the README, named it the same, refined_mf.simmap. My ped def file (sims.def) file is pretty simple:

# test definition
def testped 5 3
1 1
2 1
3 1

When not using an input VCF I don't run into any issue:

$ ped-sim -d sims.def -m refined_mf.simmap --intf ../nu_p_campbell.tsv  -o sims/test-gbr
Pedigree simulator!  v1.1.16    (Released  8 Feb 2021)

  Def file:             sims.def
  Map file:             refined_mf.simmap
  Input VCF:            [none: no genetic data]
  Output prefix:        sims/test-gbr

  Random seed:          2550799409

  Interference file:    ../nu_p_campbell.tsv

Simulating haplotype transmissions... done.
Printing IBD segments... done.

To simulate genetic data, must use an input VCF with 20 founders.

However, when I do try to simulate variants using an input VCF, I run into a problem:

$ ped-sim -d sims.def -m refined_mf.simmap --intf ../nu_p_campbell.tsv -i gbr.22.snps.vcf.gz -o sims/test-gbr
Pedigree simulator!  v1.1.16    (Released  8 Feb 2021)

  Def file:             sims.def
  Map file:             refined_mf.simmap
  Input VCF:            gbr.22.snps.vcf.gz
  Output prefix:        sims/test-gbr

  Random seed:          4250646984

  Interference file:    ../nu_p_campbell.tsv

  Genotype error rate:  1.0e-03
  Opposite homozygous error rate:       0.00
  Missingness rate:     1.0e-03
  Pseudo-haploid rate:  0

  Not retaining extra samples in output VCF (printing only simulated samples)
  Output VCF will contain unphased data

Simulating haplotype transmissions... done.
Printing IBD segments... done.
Reading input VCF meta data... done.
  Input contains 40 samples, using 20 as founders, and retaining 0
Generating VCF file... ERROR: chromosome 22 in VCF file either out of order or not present
       in genetic map

Prior to running this I checked that there were no SNPs in the VCF outside the range defined in the map. I created the VCF with minimal pre-processing starting from the GRCh37 1000 Genomes chromosome 22 VCF. I first collected sample IDs for 20 unrelated samples from GBR, then subsetted the VCF getting only those samples, only variants on chromosome 22 between the region 17152611-51175626, and the -v snps -m2 -M2 -i 'INFO/AF>0.05' gets only biallelic SNPs with a global MAF>0.05, then finally sorts the output and writes a new VCF.

bcftools view ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
    --regions 22:17152611-51175626 \
    --samples 40-gbr.txt \
    -v snps -m2 -M2 -i 'INFO/AF>0.05' \
    | bcftools sort -Oz -o gbr.22.snps.vcf.gz

Here's that VCF in case you want it for a reproducible example:

gbr.22.snps.vcf.gz

Looking at the first few and last few chromosome 22 entries on the sex-specific map:

$ grep ^22 refined_mf.simmap | sort -nk2 | head -n2
22      17087705        0       0
22      17152611        0.003573334924  0.0023823617488

$ grep ^22 refined_mf.simmap | sort -nk2 | tail -n2
22      51175626        61.01549487409  76.3046256962048
22      51229805        61.023018820178 76.3068982668672

... shows that subsetting the vcf to the region 22:17152611-51175626 shouldn't result in any variants outside the range defined by the map. Indeed, checking the VCF itself shows that the first record is at position 17152611, and the last is at position 51175626, both included in the sex-specific map (refined_mf.simmap).

$ bcftools view --no-header gbr.22.snps.vcf.gz | head -n1 | cut -f1-6
22      17152611        rs4819849       A       G       100

$ bcftools view --no-header gbr.22.snps.vcf.gz | tail -n1 | cut -f1-6
22      51175626        rs3810648       A       G       100

Thanks, and please let me know if I can help with reproducing this issue on your end!

@stephenturner
Copy link
Contributor Author

Here's the sex-specific map I'm using (compressed)

refined_mf.simmap.gz

@williamslab
Copy link
Owner

Thanks for posting! If you remove all chromosomes except 22 from the refined_mf.simmap file, it will work. Though this can change, for now, though you can omit later chromosomes, Ped-sim requires the VCF's first and successive chromosomes to match the first and later chromosomes in the genetic map. So it's looking for chromosome 1 here.

@williamslab
Copy link
Owner

Note that when using sex-specific maps, simulating all chromosomes simultaneously is necessary to ensure that the parent sexes are the same on all chromosomes, so if you can, try to merge the autosomal data into one VCF. (Support for chrX is coming hopefully in March.)

@stephenturner
Copy link
Contributor Author

Thanks for the quick reply! Removing non chr22 from that map should be an easy fix.

Regarding the second - my initial intent was to simulate only a single chromosome's worth of data, not simulating genome-wide data one chromosome at a time. That should still work, right?

@williamslab
Copy link
Owner

For sure, should work! Hope the documentation helps.

williamslab added a commit that referenced this issue Feb 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants