-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty yaml file types (and incorrect formatting) #1424
Comments
@karatugo could this please be worked on along with the other yaml issues, thanks |
@jiyue1214 will check how many of these have been resolved already and also deal with the quotation marks. |
Based on the studies status on 27th September 2024:
Noticeably, we have not set any limitation on the value of the file_type field. |
Hi @earlEBI, I extracted the first two rows from each sumstat reformat them into "header: value" and identified their file type based on it. Here are the results. Could you please help me to check if they are correct? Hi, @karatugo. Since there are more than 800 studies, could you please suggest any best practices for updating both the DB and meta-yaml files? Thank you for your support and help, |
@jiyue1214 file types look good to me! |
@jiyue1214 I can update the DB by writing a new script. |
@earlEBI, Thank you so much for checking the format. for the data need to be reformatted, GCST90012644, GCST90012743, GCST90012778, GCST90012648, GCST90012784 seems pre-publication data that I cannot find the publication information. It will be super helpful if you can to let me know if A1 is the effect allele. By the way, could I ask for non-GWAS-SSF format, there are two ways representing it, could I know which one is the correct? non-GWAS-SSF or Non-GWAS-SSF? I found this website, i am still unclear which should be. |
@karatugo Thank you sooo much for your help. Here are two issues:
|
@jiyue1214 Working on a script now that will fix missing file types according to your list. |
@jiyue1214 Re: 2 - I'll update them according to #1424 (comment), i.e.,
|
I agree, it's not consistently used in documentation! I think I would go with 'non-GWAS-SSF' as it matches the case of 'pre-GWAS-SSF' |
Thank you @earlEBI for the explanation. @karatugo, let's keep using |
For item 1.
|
For item 2.
|
Fixed the cases pushed through validation - they are missing GCST IDs in the meta collection, and need a more complex script. Find the script at cron scripts dir for prod. |
Marked GCSTs as pending. They should be generated in 2 days. |
All file types are in the set
|
@earlEBI Do you want me to generate all yaml files to make sure that the file types are consistent? |
@karatugo Does this mean there are some file_types which are empty? I don't think that should be allowed? |
@jiyue1214 Hi again! These GCSTs are all from PMID 33441150 but were permanently unpublished from the Catalog by user request due to unclear traits. I guess those folders should be deleted from public ftp really. Can probably stay on staging but don't need reformatting now : ) |
Here's the list of GCST IDs with empty file type.
|
@karatugo I checked the first 2 have file type 'GWAS-SSF v1.0' in their yaml currently. Could you read the file type from the yaml for all of these? |
Could you just regenerate yamls for those in Yue's list PLUS any with file type currently beginning "' " which I don't think are included (eg. GCST90319314) |
Sure, I'll fix them too. |
The ones beginning "' " are all fixed now but unfortunately I didn't keep a list of them. |
File types in yamls GCST90310293 - GWAS-SSF v1.0 |
All file types are GWAS-SSF v1.0 in Mongo DB as well except for GCST90315948 it's empty. This excludes harmonised files though. @jiyue1214 For harmonised, what should be their file type? |
@karatugo GCST90315948 should be GWAS-SSF |
@karatugo |
fixed GCST90315948 file type as |
Yamls will be generated in 2 days for:
|
@earlEBI @Santhi1901 please confirm |
@karatugo I think these are all fixed except for 40 harmonised yamls still have the leading quotation mark and space: |
@karatugo Sorry, there are also 593 yamls with empty file type still. I think a lof of them are strange cases - harmonised yamls, old formatted yamls, etc... |
They should be available in the public ftp in 2 days. |
HI, @earlEBI. I found most of the files in the list are f.tsv.gz files in the harmonised folder, and their raw file name did not follow the pattern GCST*.tsv. For this situation, we decided to add a yaml file for h.tsv.gz file and keep the rest of them in the way they currently are. Thanks for reminding me and I will gather them in the ticket Old harmonised data missing yaml files |
@karatugo I created a new ticket for new studies @earlEBI reported. Feel free to close the ticket when those 40 studies are done. |
They are not done yet, will check again on Monday. |
Checked a few of them and they are present in the public ftp. |
@karatugo Looks good! |
Empty yaml file types
Found 785 yamls with empty file types - GCSTs listed in attached .txt file:
yamls-with-empty-filetypes.txt
(Unfortunately these are not all GWAS-SSF so file headers would need to be checked to determine correct file type.)
Incorrect file_type formatting
Also found several yamls with file_type ' GWAS-SSFv1.0' or ' GWAS-SSF v1.0' (with single quotation marks and beginning whitespace (eg. GCST90319314). These should be removed so it reads eg. file_type: GWAS-SSFv1.0.
(There is some variability about usage of 'GWAS-SSFv1.0' and 'GWAS-SSF v1.0' with added space. Could this also be cleaned up?)
The text was updated successfully, but these errors were encountered: