more informative error messages #18

cjbattey · 2020-10-04T21:09:47Z

This PR adds more informative error messages for merging sample metadata. The old version worked correctly but didn't point to which sample IDs were off. It also had a slightly weird behavior where you could add extra rows to the metadata table which would be silently dropped when we reindex on the genotype sample names. Now it checks that the metadata and genotypes sample vectors are the same length and that all genotype IDs are in the table. Any mismatches or duplicates are printed to screen.

Adding a duplicate row to sample_data now prints:

reading VCF  
[read_vcf] 11527 rows in 0.49s; chunk in 0.49s (23444 rows/s)   
[read_vcf] all done (23442 rows/s)   
error: problem merging genotypes and sample_data  
duplicate sample_data entries: ['msp_458']

Changing one of the sample_data IDs prints:

[read_vcf] 11527 rows in 0.48s; chunk in 0.48s (24112 rows/s)  
[read_vcf] all done (24110 rows/s)  
error: problem merging genotypes and sample_data  
 vcf samples missing from sample_data: ['msp_458']   
 sample_data samples missing from vcf: ['msp_typo_test']

and deleting a row in sample_data prints:

[read_vcf] 11527 rows in 0.48s; chunk in 0.48s (24112 rows/s)  
[read_vcf] all done (24110 rows/s)  
error: problem merging genotypes and sample_data  
 vcf samples missing from sample_data: ['msp_458']   
 sample_data samples missing from vcf: []

clararehmann · 2020-10-07T19:07:51Z

scripts/locator.py

+              np.unique(sample_data['sampleID2'])[np.where(sample_data_counts>1)])
+        return
+    missing_from_metadata=[x not in np.array(sample_data['sampleID2']) for x in samples]
+    missing_from_vcf=[x not in samples for x in sample_data['sampleID2']]


I'm working on a version of locator that can run on a subset of samples within a zarr file - that might be good to bounce to from here (or make that an option)?

clararehmann · 2020-10-07T19:09:02Z

scripts/locator.py

 def sort_samples(samples):
+    samples = np.array(samples.astype('str'))


more informative error messages

6f27d9f

cjbattey assigned clararehmann Oct 4, 2020

cjbattey requested a review from clararehmann October 4, 2020 21:10

cjbattey unassigned clararehmann Oct 4, 2020

clararehmann reviewed Oct 7, 2020

View reviewed changes

scripts/locator.py

def sort_samples(samples):

samples = np.array(samples.astype('str'))

Copy link

Member

clararehmann Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thumbs up

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more informative error messages #18

more informative error messages #18

cjbattey commented Oct 4, 2020

clararehmann Oct 7, 2020

clararehmann Oct 7, 2020

		def sort_samples(samples):
		samples = np.array(samples.astype('str'))

more informative error messages #18

Are you sure you want to change the base?

more informative error messages #18

Conversation

cjbattey commented Oct 4, 2020

clararehmann Oct 7, 2020

Choose a reason for hiding this comment

clararehmann Oct 7, 2020

Choose a reason for hiding this comment