Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more informative error messages #18

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

more informative error messages #18

wants to merge 1 commit into from

Conversation

cjbattey
Copy link
Collaborator

@cjbattey cjbattey commented Oct 4, 2020

This PR adds more informative error messages for merging sample metadata. The old version worked correctly but didn't point to which sample IDs were off. It also had a slightly weird behavior where you could add extra rows to the metadata table which would be silently dropped when we reindex on the genotype sample names. Now it checks that the metadata and genotypes sample vectors are the same length and that all genotype IDs are in the table. Any mismatches or duplicates are printed to screen.

Adding a duplicate row to sample_data now prints:

reading VCF  
[read_vcf] 11527 rows in 0.49s; chunk in 0.49s (23444 rows/s)   
[read_vcf] all done (23442 rows/s)   
error: problem merging genotypes and sample_data  
duplicate sample_data entries: ['msp_458']  

Changing one of the sample_data IDs prints:

[read_vcf] 11527 rows in 0.48s; chunk in 0.48s (24112 rows/s)  
[read_vcf] all done (24110 rows/s)  
error: problem merging genotypes and sample_data  
 vcf samples missing from sample_data: ['msp_458']   
 sample_data samples missing from vcf: ['msp_typo_test']  

and deleting a row in sample_data prints:

[read_vcf] 11527 rows in 0.48s; chunk in 0.48s (24112 rows/s)  
[read_vcf] all done (24110 rows/s)  
error: problem merging genotypes and sample_data  
 vcf samples missing from sample_data: ['msp_458']   
 sample_data samples missing from vcf: []  

np.unique(sample_data['sampleID2'])[np.where(sample_data_counts>1)])
return
missing_from_metadata=[x not in np.array(sample_data['sampleID2']) for x in samples]
missing_from_vcf=[x not in samples for x in sample_data['sampleID2']]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on a version of locator that can run on a subset of samples within a zarr file - that might be good to bounce to from here (or make that an option)?

def sort_samples(samples):
samples = np.array(samples.astype('str'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thumbs up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants