Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty yaml file types (and incorrect formatting) #1424

Open
earlEBI opened this issue Sep 11, 2024 · 40 comments
Open

Empty yaml file types (and incorrect formatting) #1424

earlEBI opened this issue Sep 11, 2024 · 40 comments
Assignees

Comments

@earlEBI
Copy link

earlEBI commented Sep 11, 2024

Empty yaml file types
Found 785 yamls with empty file types - GCSTs listed in attached .txt file:
yamls-with-empty-filetypes.txt
(Unfortunately these are not all GWAS-SSF so file headers would need to be checked to determine correct file type.)

Incorrect file_type formatting
Also found several yamls with file_type ' GWAS-SSFv1.0' or ' GWAS-SSF v1.0' (with single quotation marks and beginning whitespace (eg. GCST90319314). These should be removed so it reads eg. file_type: GWAS-SSFv1.0.
(There is some variability about usage of 'GWAS-SSFv1.0' and 'GWAS-SSF v1.0' with added space. Could this also be cleaned up?)

@karatugo karatugo self-assigned this Sep 17, 2024
@ljwh2
Copy link
Contributor

ljwh2 commented Sep 19, 2024

@karatugo could this please be worked on along with the other yaml issues, thanks

@jiyue1214 jiyue1214 self-assigned this Sep 25, 2024
@ljwh2
Copy link
Contributor

ljwh2 commented Sep 25, 2024

@jiyue1214 will check how many of these have been resolved already and also deal with the quotation marks.

@jiyue1214
Copy link

Based on the studies status on 27th September 2024:

Number of Studies Study Type in YAML File
803 ''
912 'GWAS-SSFv1.0'
35,823 GWAS-SSFv1.0
16,042 non-GWAS-SSF
1 'Non-GWAS-SSF'
890 Non-GWAS-SSF
56,677 pre-GWAS-SSF

Noticeably, we have not set any limitation on the value of the file_type field.
Harmonisation queue script detects the field type by if the value of the file_type starts with GWAS-SSF or pre-GWAS-SSF. if none of them, the file_type will be set as "not_harm" automatically.

@jiyue1214
Copy link

Hi @earlEBI, I extracted the first two rows from each sumstat reformat them into "header: value" and identified their file type based on it. Here are the results. Could you please help me to check if they are correct?

Hi, @karatugo. Since there are more than 800 studies, could you please suggest any best practices for updating both the DB and meta-yaml files?

Thank you for your support and help,

@earlEBI
Copy link
Author

earlEBI commented Oct 3, 2024

@jiyue1214 file types look good to me!
Just checking, did you pick up file types starting with quotation mark and space, eg. GCST90319314 ?
Also, for the note column re:reformatting, I would check A1 is effect allele from readme etc. (you might have already done so!)

@karatugo
Copy link
Member

karatugo commented Oct 7, 2024

@jiyue1214 I can update the DB by writing a new script.

@jiyue1214
Copy link

jiyue1214 commented Oct 7, 2024

@earlEBI, Thank you so much for checking the format. for the data need to be reformatted, GCST90012644, GCST90012743, GCST90012778, GCST90012648, GCST90012784 seems pre-publication data that I cannot find the publication information. It will be super helpful if you can to let me know if A1 is the effect allele.

By the way, could I ask for non-GWAS-SSF format, there are two ways representing it, could I know which one is the correct? non-GWAS-SSF or Non-GWAS-SSF? I found this website, i am still unclear which should be.

@jiyue1214
Copy link

jiyue1214 commented Oct 7, 2024

@karatugo Thank you sooo much for your help. Here are two issues:

  1. missing the file_type: I list the correct file_type and GCST in the result file.
  2. file types starting with quotation mark and space, like 'GWAS-SSF v1.0', which the quotation need to be removed. For correcting it, could I ask if you need GCST_ids for them or your database can capture it? The correct one should be GWAS-SSF v1.0, with a space in the middle.
  3. Thanks again!

@karatugo
Copy link
Member

karatugo commented Oct 7, 2024

@jiyue1214 Working on a script now that will fix missing file types according to your list.

@karatugo
Copy link
Member

karatugo commented Oct 7, 2024

@jiyue1214 Re: 2 - I'll update them according to #1424 (comment), i.e.,

  • {''} -> empty string without quotation marks
  • {'GWAS-SSFv1.0',' GWAS-SSFv1.0', ' GWAS-SSF v1.0'} -> GWAS-SSF v1.0
  • {'Non-GWAS-SSF', 'Non GWAS-SSF', Non-GWAS-SSF} -> non-GWAS-SSF

@earlEBI
Copy link
Author

earlEBI commented Oct 7, 2024

@earlEBI, Thank you so much for checking the format. for the data need to be reformatted, GCST90012644, GCST90012743, GCST90012778, GCST90012648, GCST90012784 seems pre-publication data that I cannot find the publication information. It will be super helpful if you can to let me know if A1 is the effect allele.

By the way, could I ask for non-GWAS-SSF format, there are two ways representing it, could I know which one is the correct? non-GWAS-SSF or Non-GWAS-SSF? I found this website, i am still unclear which should be.

I agree, it's not consistently used in documentation! I think I would go with 'non-GWAS-SSF' as it matches the case of 'pre-GWAS-SSF'

@jiyue1214
Copy link

jiyue1214 commented Oct 7, 2024

Thank you @earlEBI for the explanation. @karatugo, let's keep using GWAS-SSF v1.0, pre-GWAS-SSF and non-GWAS-SSF. I added constraint to the metadata schema on gwas-sumstat-tools.

@karatugo
Copy link
Member

karatugo commented Oct 7, 2024

@karatugo

For item 1.

  • Run update script again because the script failed due to Hinxton sw update
    • Fix the cases pushed through validation - they are missing GCST IDs in the meta collection, and need a more complex script
  • Mark GCSTs as pending

@karatugo
Copy link
Member

karatugo commented Oct 7, 2024

@karatugo

For item 2.

  • Make file types consistent for all documents in the collection as agreed above

@karatugo
Copy link
Member

karatugo commented Oct 9, 2024

Fixed the cases pushed through validation - they are missing GCST IDs in the meta collection, and need a more complex script. Find the script at cron scripts dir for prod.

@karatugo
Copy link
Member

karatugo commented Oct 9, 2024

Marked GCSTs as pending. They should be generated in 2 days.

@karatugo
Copy link
Member

karatugo commented Oct 9, 2024

All file types are in the set

`` 
`GWAS-SSF v1.0`
`non-GWAS-SSF`
`pre-GWAS-SSF`

@karatugo
Copy link
Member

karatugo commented Oct 9, 2024

@earlEBI Do you want me to generate all yaml files to make sure that the file types are consistent?

@earlEBI
Copy link
Author

earlEBI commented Oct 10, 2024

All file types are in the set

`` 
`GWAS-SSF v1.0`
`non-GWAS-SSF`
`pre-GWAS-SSF`

@karatugo Does this mean there are some file_types which are empty? I don't think that should be allowed?

@earlEBI
Copy link
Author

earlEBI commented Oct 10, 2024

@earlEBI, Thank you so much for checking the format. for the data need to be reformatted, GCST90012644, GCST90012743, GCST90012778, GCST90012648, GCST90012784 seems pre-publication data that I cannot find the publication information. It will be super helpful if you can to let me know if A1 is the effect allele.

By the way, could I ask for non-GWAS-SSF format, there are two ways representing it, could I know which one is the correct? non-GWAS-SSF or Non-GWAS-SSF? I found this website, i am still unclear which should be.

@jiyue1214 Hi again! These GCSTs are all from PMID 33441150 but were permanently unpublished from the Catalog by user request due to unclear traits. I guess those folders should be deleted from public ftp really. Can probably stay on staging but don't need reformatting now : )

@karatugo
Copy link
Member

Here's the list of GCST IDs with empty file type.

GCST90310293
GCST90315948
GCST90315948
GCST90319473
GCST90319474
GCST90319475
GCST90319476
GCST90319477
GCST90319478
GCST90319479
GCST90319480
GCST90319481
GCST90319482
GCST90319483
GCST90319484
GCST90319485
GCST90319486

@earlEBI
Copy link
Author

earlEBI commented Oct 10, 2024

Here's the list of GCST IDs with empty file type.

GCST90310293
GCST90315948
GCST90315948
GCST90319473
GCST90319474
GCST90319475
GCST90319476
GCST90319477
GCST90319478
GCST90319479
GCST90319480
GCST90319481
GCST90319482
GCST90319483
GCST90319484
GCST90319485
GCST90319486

@karatugo I checked the first 2 have file type 'GWAS-SSF v1.0' in their yaml currently. Could you read the file type from the yaml for all of these?

@earlEBI
Copy link
Author

earlEBI commented Oct 10, 2024

@earlEBI Do you want me to generate all yaml files to make sure that the file types are consistent?

Could you just regenerate yamls for those in Yue's list PLUS any with file type currently beginning "' " which I don't think are included (eg. GCST90319314)

@karatugo
Copy link
Member

Here's the list of GCST IDs with empty file type.

GCST90310293
GCST90315948
GCST90315948
GCST90319473
GCST90319474
GCST90319475
GCST90319476
GCST90319477
GCST90319478
GCST90319479
GCST90319480
GCST90319481
GCST90319482
GCST90319483
GCST90319484
GCST90319485
GCST90319486

@karatugo I checked the first 2 have file type 'GWAS-SSF v1.0' in their yaml currently. Could you read the file type from the yaml for all of these?

Sure, I'll fix them too.

@karatugo
Copy link
Member

@earlEBI Do you want me to generate all yaml files to make sure that the file types are consistent?

Could you just regenerate yamls for those in Yue's list PLUS any with file type currently beginning "' " which I don't think are included (eg. GCST90319314)

The ones beginning "' " are all fixed now but unfortunately I didn't keep a list of them.

@karatugo
Copy link
Member

karatugo commented Oct 10, 2024

File types in yamls

GCST90310293 - GWAS-SSF v1.0
GCST90315948 - yaml not generated yet
GCST90319473 - GWAS-SSF v1.0
GCST90319474 - GWAS-SSF v1.0
GCST90319475 - GWAS-SSF v1.0
GCST90319476 - GWAS-SSF v1.0
GCST90319477 - GWAS-SSF v1.0
GCST90319478 - GWAS-SSF v1.0
GCST90319479 - GWAS-SSF v1.0
GCST90319480 - GWAS-SSF v1.0
GCST90319481 - GWAS-SSF v1.0
GCST90319482 - GWAS-SSF v1.0
GCST90319483 - GWAS-SSF v1.0
GCST90319484 - GWAS-SSF v1.0
GCST90319485 - GWAS-SSF v1.0
GCST90319486 - GWAS-SSF v1.0

@karatugo
Copy link
Member

karatugo commented Oct 10, 2024

GCST90310293 - GWAS-SSF v1.0
GCST90315948 - yaml not generated yet
GCST90319473 - GWAS-SSF v1.0
GCST90319474 - GWAS-SSF v1.0
GCST90319475 - GWAS-SSF v1.0
GCST90319476 - GWAS-SSF v1.0
GCST90319477 - GWAS-SSF v1.0
GCST90319478 - GWAS-SSF v1.0
GCST90319479 - GWAS-SSF v1.0
GCST90319480 - GWAS-SSF v1.0
GCST90319481 - GWAS-SSF v1.0
GCST90319482 - GWAS-SSF v1.0
GCST90319483 - GWAS-SSF v1.0
GCST90319484 - GWAS-SSF v1.0
GCST90319485 - GWAS-SSF v1.0
GCST90319486 - GWAS-SSF v1.0

All file types are GWAS-SSF v1.0 in Mongo DB as well except for GCST90315948 it's empty. This excludes harmonised files though.

@jiyue1214 For harmonised, what should be their file type?

@earlEBI
Copy link
Author

earlEBI commented Oct 11, 2024

@karatugo GCST90315948 should be GWAS-SSF

@earlEBI
Copy link
Author

earlEBI commented Oct 11, 2024

@karatugo
I think there are 2,482 GCSTs with yaml file_type: ' GWAS-SSF v1.0'
Listed in this .txt file:
' GWAS-SSF file_types.txt

@karatugo
Copy link
Member

fixed GCST90315948 file type as GWAS-SSF v1.0.

@karatugo
Copy link
Member

Yamls will be generated in 2 days for:

@ljwh2
Copy link
Contributor

ljwh2 commented Oct 23, 2024

@earlEBI @Santhi1901 please confirm

@earlEBI
Copy link
Author

earlEBI commented Oct 28, 2024

@karatugo I think these are all fixed except for 40 harmonised yamls still have the leading quotation mark and space:
yaml_harm_file_type_errors.txt

@earlEBI
Copy link
Author

earlEBI commented Oct 28, 2024

@karatugo Sorry, there are also 593 yamls with empty file type still. I think a lof of them are strange cases - harmonised yamls, old formatted yamls, etc...
Might need some more headers analysis run to determine file_types?
empty-yamls-28.10.24.txt

@karatugo
Copy link
Member

@karatugo I think these are all fixed except for 40 harmonised yamls still have the leading quotation mark and space: yaml_harm_file_type_errors.txt

They should be available in the public ftp in 2 days.

@jiyue1214
Copy link

jiyue1214 commented Oct 30, 2024

HI, @earlEBI. I found most of the files in the list are f.tsv.gz files in the harmonised folder, and their raw file name did not follow the pattern GCST*.tsv. For this situation, we decided to add a yaml file for h.tsv.gz file and keep the rest of them in the way they currently are.

Thanks for reminding me and I will gather them in the ticket Old harmonised data missing yaml files

@jiyue1214
Copy link

@karatugo I created a new ticket for new studies @earlEBI reported. Feel free to close the ticket when those 40 studies are done.

@karatugo
Copy link
Member

They are not done yet, will check again on Monday.

@karatugo
Copy link
Member

karatugo commented Nov 5, 2024

@karatugo I think these are all fixed except for 40 harmonised yamls still have the leading quotation mark and space: yaml_harm_file_type_errors.txt

Checked a few of them and they are present in the public ftp.

@earlEBI
Copy link
Author

earlEBI commented Nov 5, 2024

@karatugo Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants